Semantic-Transferable Weakly-Supervised Endoscopic Lesions Segmentation

08/21/2019 ∙ by Jiahua Dong, et al. ∙ 0

Weakly-supervised learning under image-level labels supervision has been widely applied to semantic segmentation of medical lesions regions. However, 1) most existing models rely on effective constraints to explore the internal representation of lesions, which only produces inaccurate and coarse lesions regions; 2) they ignore the strong probabilistic dependencies between target lesions dataset (e.g., enteroscopy images) and well-to-annotated source diseases dataset (e.g., gastroscope images). To better utilize these dependencies, we present a new semantic lesions representation transfer model for weakly-supervised endoscopic lesions segmentation, which can exploit useful knowledge from relevant fully-labeled diseases segmentation task to enhance the performance of target weakly-labeled lesions segmentation task. More specifically, a pseudo label generator is proposed to leverage seed information to generate highly-confident pseudo pixel labels by incorporating class balance and super-pixel spatial prior. It can iteratively include more hard-to-transfer samples from weakly-labeled target dataset into training set. Afterwards, dynamically searched feature centroids for same class among different datasets are aligned by accumulating previously-learned features. Meanwhile, adversarial learning is also employed in this paper, to narrow the gap between the lesions among different datasets in output space. Finally, we build a new medical endoscopic dataset with 3659 images collected from more than 1100 volunteers. Extensive experiments on our collected dataset and several benchmark datasets validate the effectiveness of our model.



There are no comments yet.


page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Weakly-supervised learning [19, 38] focuses on learning a pixel-level lesion segmentation model for medical images with only weakly-labeled (image-level) annotations. Due to the slight requirements for large-scale, high-quality fully-labeled (pixel-level) annotations, it has been widely-explored in a number of medical diagnosis tasks, e.g., automated glaucoma detection [43], thoracic disease localization [39], histopathology segmentation [19], etc.

Figure 1: Demonstration of our semantic lesion representation transfer model, where the left and right images are from gastroscope and enteroscopy datasets, respectively. Our model learns the semantic transferable knowledge from source data to target data via pseudo pixel-label and dynamically-searched feature centroids (i.e., different shapes) of each class.

However, weakly-supervised learning is a huge challenge for semantic lesions segmentation since 1) effective constraints or domain expertise are needed to learn the internal representation related to image-level annotations, which can produce inaccurate and coarse lesion regions; 2) it ignores the strong probabilistic dependencies between target lesions segmentation task and well-to-annotated source diseases, where such dependencies are treated as semantic knowledge. For example, diseases detected by both gastroscope and enteroscopy tend to share similar appearances, and further have similar prior distributions. Based on such dependencies, in this paper, we explore how to transfer semantic knowledge from closely-related fully-annotated source dataset (e.g., gastroscope images) to weakly-labeled target dataset (e.g., enteroscopy images).

To take advantage of the semantic transferable knowledge, we propose a new weakly-supervised semantic lesions representation transfer model as shown in Figure 1, and its goal is to learn the transferable semantic knowledge from fully-labeled source diseases dataset to improve the segmentation performance on target weakly-labeled lesions segmentation task. The core idea of our model is a pseudo pixel-label generator, which can leverage seed information by incorporating class balance with super-pixel prior [1] to further prevent the dominance of well-to-transfer categories. The hard-to-transfer samples can be incrementally introduced from the target dataset into training set. Afterwards, to mitigate the mapping features gaps of same class among source and target datasets, we endeavor to learn transferable knowledge by aligning the dynamically-searched feature centroids, which are gradually reckoned with previously-learned features and highly-confident pseudo labels. Meanwhile, adversarial learning is utilized in the output space to drive the segmentation outputs of the source and target datasets to share closer global distribution. Finally, we conduct the experiments on our built medical endoscopic dataset and several benchmark datasets to justify the superiority of our model. The experimental results can strongly support the effectiveness of our proposed model.

The contributions of our work are as follows:

  • We develop a new semantic lesion representation transfer model for weakly-supervised lesions segmentation. To our best knowledge, this is an earlier exploration about semantic transfer for endoscopic lesions segmentation in the medical image analysis field.

  • A pseudo pixel label generator is proposed to progressively mine more highly-confident pseudo labels, which can not only include more hard-to-transfer samples from the target dataset into training set, but also achieve class balance with super-pixel priors.

  • A new medical endoscopic dataset with 3659 images collected from more than 1100 volunteers is built. We demonstrate the effectiveness of our model against several state-of-the-arts on our endoscopic dataset and several benchmark datasets.

2 Related Work

In this section, we discuss some representative related works about semantic lesion segmentation and semantic representation transfer.

Semantic Lesion Segmentation: Computer aided diagnosis (CAD) [31, 9, 37, 7] is developed to assist clinician to improve the efficiency and accuracy of medical lesions segmentation. Traditional methods rely on local image features handcrafted by domain experts [18, 6]. To further improve the segmentation quality, most advanced methods [28, 20, 10]

based on convolutional neural networks

[14, 32, 4] are proposed, which can achieve state-of-the-arts performance but acquire lots of pixel-level annotations. Thus, weakly-supervised semantic lesions segmentation methods [19, 38] are proposed to save annotation efforts. However, there is currently still a large segmentation performance gap between models trained only with image tags and models trained with pixel annotations.

Semantic Knowledge Transfer: Learning the semantic transferable representation from source dataset to target dataset for classification task via generative adversarial network [13] has been widely-explored [23, 34, 35, 24, 15]. As pointed out in [42], methods addressing classification transfer do not translate well to the semantic segmentation task, which is still a significant challenge. Recently, Bousmalis et al. [2] propose to learn transferable knowledge via transferring the source images to target dataset. [42] utilizes a curriculum learning approach to mitigate the gap between source and target dataset. Several researches [16, 5, 15, 12, 33] focus on employing adversarial learning to semantic segmentation transfer in the feature space. [17] introduces an additional generator conditioned on the extra auxiliary information for target dataset. [44] exploits a self-training strategy for semantic representation transfer. However, existing models cannot be directly applied to semantic lesion transfer, since 1) they cannot ensure the features in same class but in different datasets are mapped nearby due to non-valid labeled information for target samples; 2) the model tends to transfer some easier-to-learned classes instead of balancing all the classes.

Therefore, we focus on learning semantic transferable knowledge by highly-confident class-balanced pseudo labels and dynamically-searched feature centroids with previously-learned experience.

Figure 2: Framework of our proposed model, where six components of our model include ResNet-50

network for feature extraction,

adversarial learning for enforcing various lesion segmentation to share closer distribution, pseudo label generator for weakly-labeled enteroscopy dataset, semantic representation transfer loss for aligning feature centroids among source and target datasets, and two subnets denoted as and for classification and segmentation , respectively.

3 The Proposed Model

In this section, we provide a brief overview about our semantic lesion representation transfer model. Then, the details about model formulation, training and testing procedures are elaborated.

3.1 Overview of Our Proposed Model

The overview architecture of our model is shown in Figure 2. Two subnets marked as and are designed for classification and segmentation tasks, respectively, where the prediction of subnet

is refined by classification probability via convolution operation, as shown in dashed arrows of Figure 

2. Suppose that the source dataset (e.g., gastroscope images) and target dataset (e.g., enteroscopy images) are denoted as and , respectively, where and are the corresponding image and pixel annotations of , and is the corresponding image annotation of . We firstly forward image of source dataset to optimize the whole network excluding discriminator . The segmentation output for image of target dataset is then predicted by subnet . Since our goal is to encourage the segmentation outputs of source dataset and target dataset to share closer distribution, discriminator takes these two predictions as the input to distinguish whether the input is from or .

Although we employ generative adversarial objective to narrow the gap of segmentation outputs between and , it cannot ensure the features of same class in different datasets (i.e., and ) are mapped nearby. Inspired by this key observation, we endeavor to learn the semantic representation transfer by aligning the feature centroid for each class. However, we do not have pixel annotations as guidance to compute centroids for target dataset . To address this issue, we propose a new method to generate pseudo pixel labels, which takes into account class balance and super-pixel segmentation priors. Based on the pseudo labels of target dataset, we utilize exponentially-weighted features based on previously-learned experience to compute semantic centroid for each class. Furthermore, the target image assigned with pseudo pixel labels is then forwarded into our model to fine-tuning the whole network.

3.2 Model Formulation

In order to learn transferable knowledge for target disease segmentation task, we formulate our proposed model as the following objective:


where and

are trade-off parameters and the definitions of each loss function are shown as follows:

Classification Loss : represents the classification loss of both target and source datasets (e.g., gastroscope and enteroscopy datasets). The subnet is utilized to discriminate whether the input image has lesion or not by the loss :


where denotes the parameters of the subnet . and are the classification softmax outputs for source and target datasets, respectively, and is the typical cross-entropy loss.

Segmentation Loss : For the subnet with softmax outputs, can be formulated as the segmentation loss for dataset with supervised pixel annotation , and dataset with assigned pseudo pixel label . It can then be formulated as:


where denotes the parameters of , and are the segmentation softmax outputs of subnet at pixel () and (), respectively.

denotes one-hot encoding of ground truth label for the

-th pixel position in image , and is assigned pseudo label for the -th pixel position in image . and

are the number of classes and one-hot vector, respectively. Notice that assigning

as can neglect this pseudo pixel label in training procedure. We thus expect the -norm regularization on can serve as a negative sparse constraint to prevent the trivial solution from ignoring all pseudo pixel labels. is a global weight to control the amount of selected pseudo labels, and a larger can promote the selection of more pseudo labels for model training.

Similar to self-paced learning [21], Eq. (3) in our model can iteratively produce pseudo pixel labels corresponding to large confidence. However, the optimization of the second term in Eq. (3) can result in two issues: (i) our model will tend to be biased towards initially easily-learned classes and neglect other hard-to-transfer classes in the training procedure; (ii) the generated pseudo labels with highly-confident scores are spatially discrete. To address the issue (i), the second term in Eq. (3) can be formulated as Eq. (4) where class-wise confidence levels are normalized.


where () are class balance parameters that determine the proportion of generated pseudo labels for each class . In order to avert dominance of large amount of pixel classes, we develop a new method for the determination of as summarized in Algorithm 1: after obtaining maximum predicted probability of each pixel for all target images, we sort the probabilities of all pixels predicted as class . can be determined when equals to the probability ranked at . The value of

is starting from 25% and empirically added by 5% in each training epoch, and the maximum portion

is set as 55%. Furthermore, the optimal solution of Eq. (4) is:


To handle the issue (ii), the pseudo labels that are produced by Eq. (5) can be refined with super-pixel spatial priors [1], which ensures spatial continuity of generated pseudo labels. Moreover, Algorithm 2 presents the details about how to apply super-pixel spatial refinement for the assignment of pseudo labels : the super-pixel priors is applied for each target image . When the -th pixel which has same spatial priors among its 8-neighborhoods has no valid pseudo labels, its pixel label can be decided via voting the pseudo labels of its 8-neighborhoods.

Adversarial Loss : To drive lesion segmentation outputs between and to share similar distribution, we utilize generative adversarial objective in this paper. Discriminator in Figure 2 takes these two segmentation softmax outputs of subnet as input to distinguish whether the input is from or , and is trained to fool . Formally, it can be defined as:


where and indicate the output of discriminator for image and , respectively, and indicates the parameters of discriminator .

0:  Subnet , the number of classes , portion of selected pseudo labels, target image ;
1:  for  do
2:     Set ;
3:     ;
4:     ;
5:     for  do
6:        ;
7:        ;
8:     end for
9:  end for
10:  for  do
11:     ;
12:     ;
14:  end forreturn ;
Algorithm 1 Determination of in Eq. (4)

Semantic Transfer : To ensure that the features of same class in different datasets and are mapped nearby, is proposed for semantic representation transfer via feature centroid alignment, which can be defined as:


where and are the centroids of the class in datasets and , respectively. is a trade-off parameter. Considering that the centroids of same class in different datasets have similar sparse property, we utilize the second term of Eq. (7

). Specifically, motivated by exponential reward design in reinforcement learning

[22, 25], we propose a new method to search centroids for each class based on exponentially-weighted previously learned features, which resort to history learned experience. Furthermore, pseudo labels generated by Algorithm 2 are used to guide semantic alignment for dataset . The details of computing centroid for each class are shown in Algorithm 3.

Instead of aligning those newly obtained centroids in each iteration directly, we propose to align the centroids via resorting previously-leaned experience to overcome two practical limitations: 1) Categorical information in each batch is often insufficient, e.g., it is possible that some classes are missing in the current training batch since the samples are randomly selected; 2) If the batch size is small, even one false pseudo label will lead to the enormous deviation between the true centroid and pseudo-labeled centroid.

0:  Enteroscopy image , width and height of image , the number of classes ;
0:  Pseudo labels ;
1:  Solve via Algorithm 1;
2:  for  do
3:     Compute initial pseudo labels via Eq. (5);
4:     Compute super-pixel segmentation priors of ;
5:     for  do
6:        Set ;
7:        if  has no pseudo labels at -th pixel then
8:           for  do
10:              ;
11:           end for
12:           ;
13:           if  then
14:              ;
15:           end if
16:        end if
17:     end forReturn the ultimate pseudo labels for ;
18:  end for
Algorithm 2 Determination of Ultimate Pseudo Pixel Labels

3.3 Details of Network Architecture

Baseline, Subnet and : We utilize DeepLab-v3 [4] architecture based on ResNet-50 [14]

as the backbone network, which is pre-trained with ImageNet

[11]. For the ResNet-50 [14]

, we remove the last classification layer and modify the stride of the last two convolutional blocks from 2 to 1 for higher dimensional output. Moreover, three dilated convolutional filters with stride of {1, 2, 4} are utilized in the last convolutional block to enlarge receptive field. As shown in Figure 

2, the output feature map generated by baseline ResNet-50 is passed into subnet for image classification. It is forwarded into subnet as well for pixel segmentation, which contains an Atrous Spatial Pyramid Pooling(ASPP) [3]

block and a pixel classifier layer.

Discriminator (D): Inspired by [26], for the discriminator

, we employ a fully convolutional networks for retaining global information compared with multi-layer perception. It consists of 5 convolutional layers with stride of 2 and kernel of 3. In more detail, the channels of 5 convolutional filters are {16, 32, 64, 64, 1}, respectively. Excluding the last convolution layer, the activation function of each filter is Leaky RELU with the parameter as 0.2.

0:  Max-iteration , classes number , the feature centroids and of each class for and ;
0:  ;
1:  for  do
2:     ;
3:     ;
4:      via Algorithm 2;
5:     Extracting pixel feature maps and by subnet for and
6:     for  do
7:        ;
8:        ;
9:        ; (Exponentially-weighted)
10:        ; (Exponentially-weighted)
11:     end for
12:     Return ;
13:  end for
Algorithm 3 Optimizing Semantic Representation Transfer Loss
Metrics Baseline [4] CDWS [19] NMD [5] Wild [16] DFN [40] LtA [33] CGAN [17] Ours
() 75.13 25.11 81.10 81.58 81.33 81.73 80.32 84.76
() 33.24 15.51 36.85 38.59 37.50 41.10 41.33 43.16
mIoU() 54.19 20.31 58.97 60.09 59.41 61.42 60.82 63.96
Table 1: Comparison performance between our proposed model and the state-of-the-arts on our medical dataset. Models with the best performance are bolded.

3.4 Training and Testing

Training: In each training step, for losses and , we firstly forward the source image (e.g., gastroscope) with the image-level label and the pixel-level annotation to the network and generate the segmentation softmax output . We then obtain the target softmax output for image (e.g., enteroscopy) only with the image-level label , and ultimate pseudo pixel labels are generated via Algorithm 2. In addition, these two segmentation outputs are passed into discriminator for optimizing . For training the objective , the centroids and for each class are computed via Algorithm 3, which resorts to previously learned features.

Testing: In testing phase, a target image (e.g., enteroscopy) is passed into feature extractor ResNet-50 followed by subnet and for classification and segmentation. The discriminator and other algorithmic designs would not be involved. As for implementation details, we use a single Titan XP GPU with 12 GB memory. The Adam optimizer is used to train whole networks with the batch size as 4. The initial learning rate is set as and it is exponential decay with the rate and step size as 0.7 and 950, respectively.

4 Experiments

In this section, we give detailed descriptions about our built dataset, and both source code and built dataset are available at Although our model is mainly designed for medical image analysis, the experiments on other benchmark datasets are also conducted to validate its generalization performance.

4.1 Dataset and Evaluation

The datasets in our experiments include one our own medical dataset, and three benchmark datasets.

Medical Endoscopic Dataset: this dataset is built by ourself, which has total 3659 images that collected from more than 1100 volunteers with various lesions, including gastritis, polyp, cancer, bleeding and ulcer. Specifically, it contains 2969 gasteroscope images and 690 enteroscopy images. In the training phase, we treat the gasteroscope images as the source dataset, whose 2400 images have the image-level labels and 569 images have both image-level labels and pixel-level annotations; enteroscopy images are treated as target dataset, whose 300 images are with their image-level labels. For the test phase, the other 390 enteroscopy images are utilized to evaluate the performance.

Cityscapes [8] is a real-world dataset about urban street scenes, which is collected in 50 cities. It consists of three disjoint subsets: training subset with 2993 images, validation subset with 503 images and test subset with 1531 images. There are total 34 distinct categories in the dataset.

GTA [27] contains 24996 images w.r.t synthetic street scenes, which are collected from realistic computer game Grand Theft Auto V based on the city of Los Angeles. The segmentation annotations are compatible with the Cityscapes dataset [8].

SYNTHIA [29] is a large synthetic dataset whose images are collected in virtual city without corresponding to any real city. For the experiments, we use its subset called SYNTHIA-RANDCITYSCAPES with 9400 images, including 12 automatically annotated object categories and some unnamed classes.

For the evaluation, we use intersection over union (IoU) as basic metric. Additionally, three derived metrics are also used, i.e., IoU of normal (), IoU of disease () and mean IoU (mIoU). The larger of the corresponding metric is, the better of the corresponding model will be.

4.2 Experiments on Medical Endoscopic Dataset

In this subsection, we validate the superiority of our model by comparing it with several state-of-the-arts on our built medical dataset:

  • Baseline (BL) model utilizes DeepLab-v3 [4] as backbone for segmentation without semantic transfer.

  • Constrained Deep Weak Supervision (CDWS) [19] exploits multi-scale learning with weak supervision by applying area constraint for segmentation predictions.

  • No More Discrimination (NMD) [5] refines segmentation module by leveraging soft pseudo labels and static object priors with multiple class-wise adaptation.

  • FCNs in the Wild (Wild) [16] designs a adversarial loss with prior constraint on pixel-level output to optimize intermediate convolutional layers.

  • Discriminative Feature Network (DFN) [40] designs both Smooth Network and Border Network to learn discriminative semantic feature.

  • Learning to Adapt (LtA) [33] exploits multi-level adaptation in the context of semantic segmentation.

  • Conditional GAN (CGAN) [17] proposes to integrate conditional GAN into the segmentation network for feature space adaptation.

For a fair comparison, we use ResNet-50 [14] as the backbone architecture and add an additional classification head to refine segmentation in this experiment. The evaluation results of our model against state-of-the-arts are presented in Table 1. As shown in Table 1, we have the following observations: 1) Compared with the state-of-the-arts [33, 17], our proposed model outperforms them by a large margin around 2.543.14%, which validates the effectiveness of our model, i.e., a pseudo label generator can mine more accurate and highly-confident pseudo labels. 2) As for mIoU, all models [5, 16, 40, 33, 17] with semantic transfer outperform the models [4, 19] without semantic transfer.

Figure 3: Visualization of the learned representations using t-SNE [36], where blue and red points are source gasteroscope samples and target enteroscopy samples, respectively. Two separated clusters denote two categories, i.e., lesion and normal.
Metrics BL BL+AL BL+AL+PL BL+AL+SRT BL+PL+SRT Ours Ours-woPL Ours-woCB Ours-woSP
() 75.13 79.81 83.08 81.71 84.38 84.76 81.71 84.08 84.22
() 33.24 39.27 41.07 41.27 43.33 43.16 41.27 40.51 42.37
mIoU() 54.19 59.54 62.07 61.49 63.58 63.96 61.69 62.29 63.30
Table 2: Ablation study and different pseudo labels designs of our model on medical dataset with Baseline network DeepLab-v3 [4] (BL), Adversarial Learning (AL), Pseudo Labels (PL), Semantic Representation Transfer (SRT) and training without pseudo labels (Ours-woPL), class balance (Ours-woCB) or super-pixel spatial priors (Ours-woSP).

Ablation Study: To validate the effectiveness of different components of our model, we also conduct experiment on our medical dataset with different components ablation, i.e., Baseline network DeepLab-v3 (BL), Adversarial Learning (AL), Pseudo Labels (PL) and Semantic Representation Transfer (SRT). As the results shown in Table 2, we can observe that when one or more components are removed, the performance degrades, e.g., the performance decreases 0.38%4.42% in terms of mIoU after removing the pseudo labels selection or semantic representation transfer. In addition, we also demonstrate the learned transferable representations in Figure 3. Notice that our model can well map the features of same class in different datasets nearby along the learning process when compared with Baseline (Figure 3 (a)) and Adversarial Learning (Figure 3 (b)), which validates that highly-confident pseudo pixel labels and previously-learned feature can further improve the performance for enteroscopy lesions segmentation.

Effect of Pseudo Labels Selection: We intend to study how different designs for pseudo labels selection affect the performance of our model, i.e., training without pseudo labels (denoted as Ours-woPL), training without class balance (denoted as Ours-woCB) and training without super-pixel spatial priors (denoted as Ours-woSP). As the results shown in Table 2, our model which is only with class balance can achieve improvement when comparing with Ours-woPL, while the training model with both class balance and super-pixel spatial priors can improve . This observation indicates that the pseudo labels component is designed reasonably. In addition, as depicted in Figure 4, the pseudo pixel label generator can iteratively generate more highly-confident pseudo pixel labels by incorporating class balance and super-pixel spatial prior.

Figure 4: The illustration of intuitive propagation of pseudo labels, where input images are from enteroscopy dataset.

Effect of Hyper-Parameters: This subsection investigates the effect of parameters and . As the results illustrated in Figure 5, we can choose the optimal and by empirically conducting extensive parameter experiments. Notice that the performance of our model has great stability when tuning the value of different parameters. Moreover, it also validates the importance of incorporating previously-learned features and sparsity property of medical endoscopic dataset.

(a) when
(b) when
Figure 5: The effect of parameters {} (left) and {} (right) on medical endoscopic dataset.
Method road sidewalk building wall fence pole light sign veg sky person rider car bus mbike bike
DF [41] 6.4 17.7 29.7 1.2 0.0 15.1 0.0 7.2 30.3 66.8 51.1 1.5 47.3 3.9 0.1 0.0 17.4
Wild [16] 11.5 19.6 30.8 4.4 0.0 20.3 0.1 11.7 42.3 68.7 51.2 3.8 54.0 3.2 0.2 0.6 20.2
CL [42] 65.2 26.1 74.9 0.1 0.5 10.7 3.7 3.0 76.1 70.6 47.1 8.2 43.2 20.7 0.7 13.1 29.0
NMD [5] 62.7 25.6 78.3 - - - 1.2 5.4 81.3 81.0 37.4 6.4 63.5 10.1 1.2 4.6 -
LSD [30] 80.1 29.1 77.5 2.8 0.4 26.8 11.1 18.0 78.1 76.7 48.2 15.2 70.5 17.4 8.7 16.7 36.1
LtA [33] 84.3 42.7 77.5 - - - 4.7 7.0 77.9 82.5 54.3 21.0 72.3 32.2 18.9 32.3 -
CGAN [17] 85.0 25.8 73.5 3.4 3.0 31.5 19.5 21.3 67.4 69.4 68.5 25.0 76.5 41.6 17.9 29.5 41.2
BL 22.5 15.4 74.1 9.2 0.1 24.6 6.6 11.7 75.0 82.0 56.5 18.7 34.0 19.7 17.1 18.5 30.4
BL+AL 74.4 30.5 75.8 13.2 0.2 19.7 4.4 4.9 78.2 82.7 44.4 16.0 63.2 33.3 13.5 26.2 36.3
BL+AL+PL 79.2 38.7 76.5 10.7 0.3 22.4 5.6 11.4 79.5 81.3 58.1 20.7 70.4 31.6 24.8 32.3 40.2
BL+AL+SRT 79.9 38.2 77.1 9.7 0.2 21.1 6.8 7.6 76.1 81.6 54.8 21.3 66.2 30.8 21.6 30.6 39.0
BL+PL+SRT 61.6 28.7 71.6 20.8 0.6 28.7 31.1 24.9 80.0 81.5 62.7 16.2 69.4 12.3 27.8 51.5 41.8
Ours-woSP 67.2 29.4 73.5 21.2 0.7 28.4 29.7 24.5 79.9 81.1 62.9 15.8 72.8 12.6 26.5 51.2 42.3
Ours 68.4 30.1 74.2 21.5 0.4 29.2 29.3 25.1 80.3 81.5 63.1 16.4 75.6 13.5 26.1 51.9 42.9
Table 3: Comparisons performance of learning transferable knowledge from SYNTHIA dataset to Cityscapes dataset. Models with best and runner-up performance are marked with red and blue colors, respectively.
Method road sidewalk building wall fence pole light sign veg terrain sky person rider car truck bus train mbike bike mIoU
DF [41] 31.9 18.9 47.7 7.4 3.1 16.0 10.4 1.0 76.5 13.0 58.9 36.0 1.0 67.1 9.5 3.7 0.0 0.0 0.0 21.1
Wild [16] 70.4 32.4 62.1 14.9 5.4 10.9 14.2 2.7 79.2 21.3 64.6 44.1 4.2 70.4 8.0 7.3 0.0 3.5 0.0 27.1
CL [42] 74.9 22.0 71.7 6.0 11.9 8.4 16.3 11.1 75.7 11.3 66.5 38.0 9.3 55.2 18.8 18.9 0.0 16.8 14.6 28.9
CyCADA [15] 79.1 33.1 77.9 23.4 17.3 32.1 33.3 31.8 81.5 26.7 69.0 62.8 14.7 74.5 20.9 25.6 6.9 18.8 20.4 39.5
LSD [30] 88.0 30.5 78.6 25.2 23.5 16.7 23.5 11.6 78.7 27.2 71.9 51.3 19.5 80.4 19.8 18.3 0.9 20.8 18.4 37.1
LtA [33] 86.5 36.0 79.9 23.4 23.3 23.9 35.2 14.8 83.4 33.3 75.6 58.5 27.6 73.7 32.5 35.4 3.9 30.1 28.1 42.4
CGAN [17] 89.2 49.0 70.7 13.5 10.9 38.5 29.4 33.7 77.9 37.6 65.8 75.1 32.4 77.8 39.2 45.2 0.0 25.2 35.4 44.5
BL 80.2 6.4 74.8 8.8 17.2 17.5 30.5 17.7 75.0 14.1 57.9 56.2 27.3 64.1 29.7 24.1 4.7 27.6 33.4 35.1
BL+AL 86.3 32.2 79.8 22.0 22.2 27.1 33.5 20.1 80.3 21.5 75.5 59.0 25.4 73.1 28.0 32.2 5.4 27.3 31.5 41.2
BL+AL+PL 91.7 48.3 76.8 25.1 28.5 28.2 39.7 44.5 79.8 13.6 72.3 53.6 19.1 85.8 23.7 44.2 32.8 13.4 31.5 44.9
BL+AL+SRT 92.4 49.8 73.6 25.3 28.3 24.5 40.9 45.0 79.2 14.2 70.4 50.1 18.6 86.6 22.3 45.4 30.3 11.9 32.8 44.3
BL+PL+SRT 92.6 47.8 77.4 26.7 28.8 29.9 42.4 46.3 80.7 15.1 71.1 55.8 24.3 86.5 21.5 42.4 43.3 12.1 30.8 46.1
Ours-woSP 92.4 47.3 78.5 25.4 27.8 34.8 42.0 44.6 79.8 15.3 67.1 60.5 30.7 86.3 26.4 43.7 36.1 14.8 33.2 46.7
Ours 92.7 48.0 78.8 25.7 27.2 36.0 42.2 45.3 80.6 14.6 66.0 62.1 30.4 86.2 28.0 45.6 35.9 16.8 34.7 47.2
Table 4: Comparison performance of learning transferable representation from GTA dataset to Cityscapes dataset. Models with best and runner-up performance are marked with red and blue colors, respectively.

4.3 Experiments on Benchmark Datasets

In this subsection, we conduct experiments on several benchmark datasets that has compatible annotations with each other to further justify the effectiveness of our model. For a fair comparison, we remove the classification head and adopt the same experimental data configuration with the completing methods [16, 42, 15, 30, 33, 17]. For the ablation studies shown in Table 3 and Table 4, BL, AL, PL, SRT and Ours-woSP indicate baseline, adversarial learning, pseudo labels, semantic lesions transfer components of our model and training without super-pixel priors, respectively.

Transfer from SYNTHIA to Cityscapes: In this experiment, our model is used to learn transferable knowledge from SYNTHIA [29] to Cityscapes [8]. For the training phase, SYNTHIA dataset with finely-annotated 9400 images is regarded as . The Cityscapes without pixel labels has 2993 images is regarded as . For the test, we use validation subset with 500 images of Cityscapes, which is disjoint with training subset. Notice that we consider 16 common classes for two datasets: road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, sky, person, rider, car, bus, motorbike and bike. From the presented results in Table 3, we can conclude that: 1) Our model outperforms state-of-the-arts [30, 17, 33] by for the remaining classes in terms of mIoU, which verifies the effectiveness of our model; 2) Ablation studies of both PL, SRT and SP also validates these components are designed reasonably; 3) Although the appearances of the hard-to-transfer classes (e.g., wall, pole, motorbike and bike) are extremely different between these two datasets, our model can also achieve comparable performance.

Transfer from GTA to Cityscapes: When conducting experiments to learn transferable representation from GTA [27] to Cityscapes [8], in the training process, GTA with finely-annotated 24996 images and the training subset with 2993 images of Cityscapes without using pixel labels are treated as and , respectively. The remaining validation subset with 500 images of Cityscapes is used for evaluation. As the results presented in Table 4, we consider 19 shared classes: road, sidewalk, building, wall, fence, traffic light, traffic sign, vegetation, terrain, sky, person, rider, car, truck, bus, train, motorbike and bike. Notice that: 1) Other semantic transfer models can be easily partial towards easy-to-transfer classes (e.g., road, building, sky, vegetation and car), while our model can achieve better performance for both initially hard-to-transfer classes and easy-to-transfer classes. 2) The ablation studies of PL, SRT and SP verify that previously-learned experience and pseudo labels play a significant role when comparing with [15, 30, 33, 17].

5 Conclusion

In this paper, we explore a new semantic lesions representation transfer model for weakly-supervised endoscopic lesions segmentation. More specifically, a pseudo pixel label generator is presented to progressively mine more samples from target data into training set, which incorporates super-pixel priors and class balance to prevent dominance of well-to-transfer categories. We also align the dynamically-searched feature centroids for each class of different datasets with previously-learned features. Experiments on our built dataset and several benchmark datasets show the effectiveness and superiority of our model.


Thanks for the medical endoscopic data provided by Prof. Yunsheng Yang from the Chinese PLA General Hospital.


  • [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk (2012-11) SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11), pp. 2274–2282. External Links: ISSN 0162-8828 Cited by: §1, §3.2.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan (2017) Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In In CVPR, Cited by: §2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018-04) DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848 (en). External Links: ISSN 0162-8828, 2160-9292, Link, Document Cited by: §3.3.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §2, §3.3, Table 1, 1st item, §4.2, Table 2.
  • [5] Y. Chen, W. Chen, Y. Chen, B. Tsai, Y. F. Wang, and M. Sun (2017-10) No More Discrimination: Cross City Adaptation of Road Scene Segmenters. In ICCV (en). Note: Comment: 13 pages, 10 figures Cited by: §2, Table 1, 3rd item, §4.2, Table 3.
  • [6] J. Cheng, Y. Chou, C. Huang, Y. Chang, C. Tiu, K. Chen, and C. Chen (2010-06) Computer-aided US Diagnosis of Breast Lesions by Using Cell-based Contour Grouping. Radiology 255 (3), pp. 746–754 (en). External Links: ISSN 0033-8419, 1527-1315, Link, Document Cited by: §2.
  • [7] Y. Cong, S. Wang, J. Liu, J. Cao, Y. Yang, and J. Luo (2015)

    Deep sparse feature selection for computer aided endoscopy diagnosis

    Pattern Recognition 48 (3), pp. 907 – 917. External Links: ISSN 0031-3203 Cited by: §2.
  • [8] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    In CVPR. External Links: Link Cited by: §4.1, §4.1, §4.3, §4.3.
  • [9] J. Cui, H. Yu, S. Chen, Y. Chen, and H. Liu (2019)

    Simultaneous estimation and segmentation from projection data in dynamic pet

    Medical physics 46 (3), pp. 1245–1259. Cited by: §2.
  • [10] A. Dasgupta and S. Singh (2017-04) A fully convolutional neural network based structured prediction approach towards the retinal vessel segmentation. In 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), Melbourne, Australia, pp. 248–251 (en). External Links: ISBN 978-1-5090-1172-8, Link, Document Cited by: §2.
  • [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2019) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR (en). Cited by: §3.3.
  • [12] Q. Dou, C. Ouyang, C. Chen, H. Chen, and P. Heng (2018) Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss. In

    Proceedings of the 27th International Joint Conference on Artificial Intelligence

    IJCAI’18, pp. 691–697. External Links: ISBN 978-0-9992411-2-7 Cited by: §2.
  • [13] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2672–2680. External Links: Link Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. In CVPR. External Links: Link, 1512.03385 Cited by: §2, §3.3, §4.2.
  • [15] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018-10–15 Jul) CyCADA: cycle-consistent adversarial domain adaptation. In

    Proceedings of the 35th International Conference on Machine Learning

    Proceedings of Machine Learning Research, Vol. 80, pp. 1989–1998. Cited by: §2, §4.3, §4.3, Table 4.
  • [16] J. Hoffman, D. Wang, F. Yu, and T. Darrell (2016) FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649. External Links: Link, 1612.02649 Cited by: §2, Table 1, 4th item, §4.2, §4.3, Table 3, Table 4.
  • [17] W. Hong, Z. Wang, M. Yang, and J. Yuan (2018-06) Conditional generative adversarial network for structured domain adaptation. In CVPR, Cited by: §2, Table 1, 7th item, §4.2, §4.3, §4.3, §4.3, Table 3, Table 4.
  • [18] K. Horsch, M. L. Giger, L. A. Venta, and C. J. Vyborny (2001-08) Automatic segmentation of breast lesions on ultrasound. Medical Physics 28 (8), pp. 1652–1659 (en). External Links: ISSN 00942405, Link, Document Cited by: §2.
  • [19] Z. Jia, X. Huang, E. I. Chang, and Y. Xu (2017-11) Constrained Deep Weak Supervision for Histopathology Image Segmentation. IEEE Transactions on Medical Imaging 36 (11), pp. 2376–2388 (en). External Links: ISSN 0278-0062, 1558-254X Cited by: §1, §2, Table 1, 2nd item, §4.2.
  • [20] Z. Jiang, H. Zhang, Y. Wang, and S. Ko (2018-09)

    Retinal blood vessel segmentation using fully convolutional network with transfer learning

    Computerized Medical Imaging and Graphics 68, pp. 1–15 (en). External Links: ISSN 08956111, Link, Document Cited by: §2.
  • [21] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pp. 1189–1197. Cited by: §3.2.
  • [22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. In ICLR. Cited by: §3.2.
  • [23] M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015-02) Learning Transferable Features with Deep Adaptation Networks. In ICML (en). External Links: Link Cited by: §2.
  • [24] M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 136–144. External Links: ISBN 978-1-5108-3881-9, Link Cited by: §2.
  • [25] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller (2013) Playing atari with deep reinforcement learning. In NIPS. Cited by: §3.2.
  • [26] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. External Links: Link, 1511.06434 Cited by: §3.3.
  • [27] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for data: ground truth from computer games. In ECCV. Cited by: §4.1, §4.3.
  • [28] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §2.
  • [29] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016-06) The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, Cited by: §4.1, §4.3.
  • [30] S. Sankaranarayanan, Y. Balaji, A. Jain, S. Lim, and R. Chellappa (2018-06) Learning from synthetic data: addressing domain shift for semantic segmentation. In CVPR, Cited by: §4.3, §4.3, §4.3, Table 3, Table 4.
  • [31] J. Shiraishi, Q. Li, D. Appelbaum, and K. Doi (2011-11) Computer-Aided Diagnosis and Artificial Intelligence in Clinical Imaging. Seminars in Nuclear Medicine 41 (6), pp. 449–462 (en). External Links: ISSN 00012998, Link, Document Cited by: §2.
  • [32] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. In CVPR. External Links: 1409.1556 Cited by: §2.
  • [33] Y. Tsai, W. Hung, S. Schulter, K. Sohn, M. Yang, and M. Chandraker (2018) Learning to adapt structured output space for semantic segmentation. In CVPR. Cited by: §2, Table 1, 6th item, §4.2, §4.3, §4.3, §4.3, Table 3, Table 4.
  • [34] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In ICCV. Cited by: §2.
  • [35] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In ICCV. Cited by: §2.
  • [36] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: Figure 3.
  • [37] S. Wang, Y. Cong, H. Fan, L. Liu, X. Li, Y. Yang, Y. Tang, H. Zhao, and H. Yu (2016-11) Computer-aided endoscopic diagnosis without human-specific labeling. IEEE Transactions on Biomedical Engineering 63 (11), pp. 2347–2358. External Links: ISSN 0018-9294 Cited by: §2.
  • [38] Y. Xu, J. Zhu, E. I. Chang, M. Lai, and Z. Tu (2014-04) Weakly supervised histopathology cancer image segmentation and classification. Medical Image Analysis 18 (3), pp. 591–604 (en). External Links: ISSN 13618415, Link, Document Cited by: §1, §2.
  • [39] C. Yan, J. Yao, R. Li, Z. Xu, and J. Huang (2018)

    Weakly supervised deep learning for thoracic disease classification and localization on chest x-rays

    In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 103–110. Cited by: §1.
  • [40] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Learning a discriminative feature network for semantic segmentation. CoRR abs/1804.09337. External Links: Link, 1804.09337 Cited by: Table 1, 5th item, §4.2.
  • [41] F. Yu and V. Koltun (2016) Multi-scale context aggregation by dilated convolutions. In ICLR. Cited by: Table 3, Table 4.
  • [42] Y. Zhang, P. David, and B. Gong (2017-10) Curriculum domain adaptation for semantic segmentation of urban scenes. In ICCV. Cited by: §2, §4.3, Table 3, Table 4.
  • [43] R. Zhao, W. Liao, B. Zou, Z. Chen, and S. Li (2019) Weakly-supervised simultaneous evidence identification and segmentation for automated glaucoma diagnosis. Cited by: §1.
  • [44] Y. Zou, Z. Yu, B.V.K. Vijaya Kumar, and J. Wang (2018-09) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In

    The European Conference on Computer Vision (ECCV)

    Cited by: §2.