Multi-level colonoscopy malignant tissue detection with adversarial CAC-UNet

06/29/2020 ∙ by Chuang Zhu, et al. ∙ 0

The automatic and objective medical diagnostic model can be valuable to achieve early cancer detection, and thus reducing the mortality rate. In this paper, we propose a highly efficient multi-level malignant tissue detection through the designed adversarial CAC-UNet. A patch-level model with a pre-prediction strategy and a malignancy area guided label smoothing is adopted to remove the negative WSIs, with which to lower the risk of false positive detection. For the selected key patches by multi-model ensemble, an adversarial context-aware and appearance consistency UNet (CAC-UNet) is designed to achieve robust segmentation. In CAC-UNet, mirror designed discriminators are able to seamlessly fuse the whole feature maps of the skillfully designed powerful backbone network without any information loss. Besides, a mask prior is further added to guide the accurate segmentation mask prediction through an extra mask-domain discriminator. The proposed scheme achieves the best results in MICCAI DigestPath2019 challenge on colonoscopy tissue segmentation and classification task. The full implementation details and the trained models are available at



There are no comments yet.


page 2

page 4

page 5

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Digestive system cancers cause major public health problems and lead to high mortality rate worldwide song2019targeting. Colorectal cancer and gastric cancer are the leading cause of digestive cancer mortality according to International Agency for Research on Cancer and American Cancer Society bray2018global; siegel2019cancer.

Motivation. It is evident that the early stage diagnosis and treatment will significantly increase treatment success and thus reduce the mortality rate nazeri2018two. Pathological checking is the golden standard for diagnosing these digestive system cancers. Generally, the pathological glass slides are made by the materials obtained in the operating room which are processed by formalin. To make the nuclei and cytoplasm visible, the slides are then dyed with hematoxylin and eosin (H & E) guo2018breast. During diagnosing phase, the specialists examine the glass slides under a microscope directly or check the generated digital pathology, such as the high-resolution whole slide image (WSI). The digital pathology based examination of WSI is becoming increasingly popular in recent years. Based on the observed features of the tissues, the pathological diagnostic results are then formed.

However, pathological diagnosis is subjective with a high inter-rater variance

gopinath2015development. Besides, the experienced pathologists who are qualified for accurate diagnosis based on WSIs are scarce, and manual analysis of WSI is a time-consuming task for the pathologists due to the large size of WSI (e.g. 100000 100000) challenge-grand-2019. Thus, an automatic and objective pathological WSI diagnostic model can be valuable to achieve early cancer detection and diagnosis.

Related Work. A variety of approaches have been developed to conduct automatic diagnosis based on pathological WSIs huang2011time; mercan2016localization; roullier2010graph; cruz2014automatic

. Due to the large size of the WSI, the direct use of the entire image as the input of the machine learning algorithms is impossible because of the great memory usage requirement

samsi2012efficient. Related solutions include, downsampling and region of interest (RoI) detection huang2011time; mercan2016localization, multi-resolution analyzing roullier2010graph, and extracting image patches cruz2014automatic; korbar2017deep; sharma2017deep; mejbri2019deep.

To alleviate the computing complexity, Huang et al. huang2011time downsampled WSIs first and then detected the RoI at the low-resolution level. The authors in work mercan2016localization

proposed another diagnostically relevant RoIs location approach based on color and texture features. The produced probability maps can achieve 74% overlap with the actual regions at which pathologists looked. Roullier

et al. proposed a highly efficient graph-based multi-resolution approach for mitosis extraction in breast cancer histological WSIs roullier2010graph. They processed each resolution level with the focus of attention resulting from a coarser resolution level analysis, and the proposed segmentation was fully unsupervised by just considering domain-specific knowledge.

The above downsampling and multi-resolution methods can alleviate the computing complexity through coarse WSI generation. However, this kind of method will introduce information loss due to the utilized coarse WSI, and thus ruin the diagnosis results. To solve this problem, many works perform patch splitting method to process WSI cruz2014automatic; tao2019highly; korbar2017deep; sharma2017deep, of which the WSI is first divided into patches and then processed by the automatic algorithms one by one. Cruz-Roa et al. cruz2014automatic first cropped the WSIs into non-overlapping image patches of

pixels via grid sampling. Then the author proposed to use a 3-layer CNN architecture to classify the extracted patches, and based on all the patch classification results to generate the final probability map for each WSI. To improve the analysis performance, the authors in work

korbar2017deep and work sharma2017deep proposed to adopt larger patch sizes and use more complex CNN models to recognize each patch. However, it is inaccurate to just combine the patch-level classification results to analyze the whole WSI tissue. To further improve the result, semantic segmentation should be performed for each patch tao2019highly.

Traditional semantic segmentation methods ilea2011image, preetha2012image conduct image segmentation based on the hand-craft features. Although these methods can achieve satisfactory performance, the design of the hand-crafted features is based on complex domain knowledge and the ability of the segmentation model is insufficient. In the past several years, Fully Convolutional Networks (FCN) based method long2015fully was proved can achieve decent semantic segmentation by modifying fully connected layers into convolution layers in CNN. The other improved state-of-the-art FCNs, such as U-Net ronneberger2015u; mejbri2019deep and SkipDeconv-Net (SD-Net) roy2017error, have shown great power in segmentation tasks. According to the specific requirements of different segmentation tasks, the recent semantic segmentation works, such as fu2019stacked, yin2019pm and wang2019dual, try to further improve segmentation performance using the stacked deconvolutional network (SDN), pyramid multi-label network (PM-Net), and dual encoding U-Net (DEU-Net), respectively. However, automated segmentation of malignant lesion is very challenging due to high variations in appearance, especially when the patches are extracted from different WSIs scanned with different equipment or parameters, as shown in Fig. 1. Thus, the direct use of FCNs to conduct segmentation is insufficient.

Figure 1: Visualization of selected lesion patches from 3 different WSIs. These patches have high variations in appearance, such as lesion structure and stain style.

Generally, the above appearance variations mean the dataset contains images with different data distributions. Most of the existing works address appearance variations by domain adaptation (DA) technique which treats the different data distributions as different domains pan2010survey. Many DA studies tsai2019domain; zou2019confidence are performed on pixel-level road-scene semantic segmentation, such as the synthetic-to-real (GTA5 richter2016playing to Cityscapes cordts2016cityscapes). In medical image processing, there are two kinds of DA solutions: pre-processing and domain-adversarial networks. The first kind of method, such as work Macenko2009A which normalized the stain while retaining the structure, and work bentaieb2018adversarial which proposed a discriminative image analysis model for stain standardization, just can alleviate the stain difference of the input image. To achieve more robust DA performance, many works propose to use domain-adversarial networks to impose constraints on the backbone network, and thus the backbone network can learn domain-invariant features lafarge2017domain; yang2018Generalizing; dou2018unsupervised; kamnitsas2017unsupervised. Lafarge et al. proposed a method based on domain-adversarial networks to remove the domain information from the model lafarge2017domain. Yang et al. proposed a novel online adversarial appearance conversion solution to explore a composite appearance and structure constraints yang2018Generalizing. Dou et al. proposed an unsupervised domain adaptation framework with a domain adaptation module (DAM) and a domain critic module (DCM) dou2018unsupervised. However, the above methods only learn a single layer’s domain-invariant feature, such as the last layer of the backbone network, and many feature maps of the network are ignored. The recent work kamnitsas2017unsupervised concatenates the multi-layer cropped feature maps and passes them to a domain-adversarial discriminator. However, it is not conducive to the discriminator for classification due to the huge number of channels and the information loss introduced by feature cropping. The recent researches conduct DA on different types of data, such as from the WSIs to microscopy images (MSIs) zhang2019whole, or use pseudo-labeling for cross-modality microscopy image xing2019adversarial. Most of these studies try to achieve unsupervised domain adaptation where the ground truth labels of the target domain are hard to obtain. In this paper, we focus on the supervised (the target domain labels are available) domain adaptation within the same data type (from WSIs to WSIs) but with different styles, such as the lesion structure and stain style.

Problems. Two related problems are denoted as follows.

Problem1: The existing directly applying patch-level segmentation to each WSI suffers the risk of false positive area detection. The WSIs which do not contain any malignant areas should be discarded before performing fine-grained segmentation. After WSI-level classification, key patches should be further selected from the malignant WSI with the similar reason: the patch without any malignant areas should not be processed by the segmentation model at all. Besides, to train a patch-level classification model, a set of training patches with ground truth labels should be generated first. However, directly label each patch containing malignant tissue as positive sample is inaccurate because different patches contain different sizes of malignant areas.

Problem2: Given the selected key patches, how to design an appearance invariant image segmentation is still very challenging. The existing segmentation model, such as UNet, is lack of the ability of appearance invariant. The recent DA solutions are suffered from feature information loss due to cropping.

The Digestive-System Pathological Detection and Segmentation Challenge 2019 (DigestPath2019), which is part of the MICCAI 2019 Grand Pathology Challenge, set up a task for evaluating automatic algorithms on colonoscopy tissue screening from digestive system pathological images challenge-grand-2019. This challenge provides a good platform to verify the above two issues. Besides, to conduct fair competition, the challenge requires each algorithm to execute on a single GPU and the average execution time on the test set can not exceed 120 seconds. This requires the designed scheme should take both accuracy and complexity into consideration.

Approach and Contributions. In this paper, we proposed a multi-level colonoscopy malignant tissue detection incorporated with domain adaptive segmentation scheme. The multi-level architecture is adopted to realize the malignant tissue detection in a coarse to fine manner, achieving lowering the risk of false positive detection while alleviating the high computing complexity at the same time. The domain adaptive segmentation is built to address the problem of appearance variations and thus boost the segmentation performance. We evaluated our method on the challenge dataset of the MICCAI 2019 challenge on lesion segmentation. Experimental results showed that our algorithm can achieve better result than other competitors. The main contributions of this paper are summarized as follows:

  • We proposed a highly efficient multi-level malignant tissue detection architecture. In our architecture, the WSI-level classification is performed based on a patch-level classifier with a pre-prediction scheme. The WSIs without any malignant areas are dropped and thus the computing time is saved. For the selected positive WSIs, multiple patch-level models are trained with skillfully selected samples and then integrated together to choose the key patches. Besides, a malignant area ratio guided label smoothing scheme is applied to further increase the model accuracy.

  • We proposed an adversarial context-aware and appearance consistency (CAC-UNet) model to achieve robust appearance-invariant segmentation. Mirror designed discriminators are able to seamlessly fuse the whole feature maps of the generator without any information loss. The mask prior is further added to guide the accurate segmentation mask prediction through an extra mask-domain discriminator. Besides, several powerful strategies are integrated into the backbone of CAC-UNet to further improve the appearance-invariant ability.

  • The proposed scheme achieved the highest dice similarity coefficient (DSC) and area under the curve (AUC) score on the dataset of MICCAI 2019 challenge on colonoscopy tissue segmentation and classification task.

Outline. The paper is organized as follows. In Section 3, we present the proposed multi-level lesion detection architecture and the domain adaptive segmentation model. Then, Section 4 reports the implementation, and analyses the experimental results. Finally, the discussion and conclusion of this paper are summarized in Section 5 and Section 6.

2 Preliminaries

In this section, we clarify some related concepts and definitions, and introduce the basic knowledge about generative adversarial networks (GAN) that is important to the proposed method.

Definitions and Concepts. It is known that a WSI is very huge, and generally it is first cropped into patches for further processing. In this paper, the boldface uppercase letter denotes a WSI, and boldface lowercase letter denotes a patch. The boldface lowercase letter and denote the predicted segmentation mask and ground truth mask for patch , respectively. Besides, the boldface uppercase letter and denote a set of predicted and ground truth segmentation masks. The WSIs or patches that contain malignant area (with ground truth label 1) are denoted as positive WSIs or patches, otherwise they are denoted as negative samples.

A domain refers to a dataset with a specific distribution, and different domains generally have images with different texture and appearances. According to work pan2010survey, a domain consists of two components: a feature space

and a marginal probability distribution



The framework of GAN is proposed for estimating generative models via an adversarial process

goodfellow2014generative. The target is to learn the generator’s distribution over data x. To achieve this, in the GAN model a generator and a discriminator are defined. For a prior on input noise variables , the generator maps the variable to a generated data space . denotes that x come from the real data rather than the generated one. The discriminator is trained targeting to tell apart real from fake input data and the generator is optimized to generate input data from the noise that “fools” the discriminator bentaieb2017adversarial. Through the adversarial training mechanism, both the discriminator and the generator are then optimized. To summarize, generator and discriminator play the two-player minimax game with value function ,


3 Proposed Method

In this section, we will first give the proposed multi-level detection architecture, and then introduce the WSI-level classification based on a patch-level model. After that, we will talk about the key patch selection and briefly explain the motivation of re-training the patch-level model in this stage. Based on the selected key patches, the domain adaptive segmentation will be discussed in detail to achieve the finest level detection.

3.1 Multi-level detection architecture

Figure 2: An overview of the proposed WSI automatic multi-level detection architecture. A WSI is first processed by RoI detection module, and thus the background or the other unimportant areas are removed. For the detected RoI area, image cropping is performed and a series of patches are produced. The cropped patches are recognized and combined to conduct WSI-level classification. The selected positive WSI is then cropped and recognized again to choose the key patches. Finally, the selected positive patches are processed by adversarial CAC-UNet.

The architecture of the proposed WSI automatic processing system is schematized in Fig. 2. The proposed multi-level architecture includes three main stages: Stage-1 WSI-level classification, judging whether the input WSI is benign or malignant, and discarding the negative WSIs; Stage-2 Key patch selection, finely classifying each patch in the malignant WSIs and choosing the positive ones as the key patches; Stage-3 Segmenting the key patches and stitching them into a complete WSI mask. We adopt DenseNet huang2017densely model for WSI-level classification and multi-model voting for patch-level classification. For the patch segmentation, we designed adversarial CAC-UNet to realize high accuracy segmentation.

3.2 WSI classification

In this part, we propose to use a patch-level model to conduct WSI classification. We first classify all the patches cropped from the important areas of WSI, and get a set of classification results. If the patch classified as positive accounts for more than a certain percentage of all patches, we infer that the WSI is positive. We then use the average of all positive or negative patches’ scores as the score for this WSI, as shown in Fig. 3.

Figure 3: WSI-level classification flow. The cropped patches are processed by DenseNet and the predicted probability map is generated. Guided by the WSI pre-prediction, the final WSI-classification score is produced by averaging selected patch-level results.

Before performing WSI-level classification, we need to remove the irrelevant background areas. A simple patch based RoI detection is applied: if the standard deviation of RGB values is less than a pre-defined threshold

, the current patch will be discarded. After RoI detection, most part of the background area is removed, and a visualized RoI detection of a WSI is shown in Fig. 4.

Figure 4: Visualization of RoI for a selected WSI: (a) the original WSI, (b) the selected RoI areas.

Let be a WSI after RoI detection, and be a cropped patch from . We propose to perform WSI-level classification.


where and are two patch-level sets, which include all the positive patches and negative patches of , respectively; and represent the pre-predicted positive and negative label, respectively;

denotes the average function. We use DenseNet-161 with ImageNet pre-trained parameters

huang2017densely as the classifier to decide whether a patch belongs to or . Based on the classification results of all the patches, we then pre-predict whether or . Then we can obtain the WSI classification score according to (2). Specifically, is decided as belonging to , if ( is a threshold with constant value). Similarly, belongs to , if . Then, we introduce (3) to pre-predict the label of . Note that the predicted label by (3) is a temporary intermediate result used to assist the generation of WSI classification score; the final WSI classification score is produced by (2).


where and denote the patch number in set and , respectively; is a threshold. In the following, we detail the training of our patch model adopted in this stage.

To train the patch-level model used in this stage, we first use a sliding window (stride=512, size=

) to crop WSI images. We then perform online data augmentations to these cropped patches, which include random folds, random brightness contrast, and grid distortion. We sampled 50% positive patches from positive WSIs and 50% negative patches from negative WSIs as training data. Due to the cropping, the malignant area varies in different patches. Directly assign each patch containing malignant area with label 1 is unreasonable. To address this problem, we used label smoothing as introduced in work szegedy2016rethinking. For the training example with ground-truth label y, the original label distribution is denoted as


where and are the training example and the corresponding label (malignant: , benign: ); is Dirac delta, which equals 1 for k = y and 0 otherwise. Similarly, we replace the original label distribution as


where is a smoothing parameter and is a distribution over labels. Different with work szegedy2016rethinking

which selected uniform distribution for

, in this paper we choose as


where (as shown in Fig. 5) and are the malignant area of current patch and maximum malignant area of all the patches; represents the ratio of the malignant area in a patch to the maximum malignant area of all patches.

Thus, in our work, the label distribution is written as


Based on (7), we will change the ground-truth label distribution according to the malignant ratio . Take an malignant patch (y = 1) for example, if this patch has big , we will encourage it to be confident with the ground truth label y; if the patch has small , which means the malignant patch having many benign areas, thus we should encourage it to be less confident with the ground truth label y.

Figure 5: Illustration of different areas of a patch: and represents the malignant and benign areas, respectively.

3.3 Key Patch Selection

In this stage (Stage-2), the target is to find all the positive patches in the WSI if it is judged as positive. Note that we do not directly use the patch model trained in Stage-1.

In Fig. 6 (a) to Fig. 6 (c), three patches are visualized: a positive patch from the positive WSI, a negative patch from the positive WSI, and a patch from the negative WSI. As denoted by the figure, the glandular structure of the malignancy patch, Fig. 6 (a), appears in heterogeneous shapes, but the benign patch, Fig. 6 (c), have a typical and uniform glandular arrangement. However, we should note that the glandular structure of the benign patch from the positive WSI, Fig. 6 (b), seems very different from the benign patch from the negative WSI, Fig. 6 (c). In fact, the benign patches from the negative WSI are more like normal lesion. In Stage-2, our target is to extract all the malignancy patches like Fig. 6 (a) from the positive WSIs, and the patches like Fig. 6 (b) can not provide useful information to assist the classification. Thus, the patches from negative WSIs are not used for training in this stage. We remind that in Stage-1, our aim is to distinguish the positive WSIs from the negative WSIs based on a patch-level model. In order to achieve more effective classification, the positive patches from the positive WSIs and patches from the negative WSIs are selected as the important information for positive WSIs and negative WSIs, respectively. The comparison of training strategies for Stage-1 and Stage-2 are summarized as Fig. 6 (d).

In work zhu2019breast, the authors proposed to assemble multiple hybrid models with the same architecture to reduce generalization error and improve performance. Different from work zhu2019breast, we apply different state-of-the-art CNN models predicting together. DenseNet connects each layer to every other layer in a feed-forward fashion, and thus it can be deeper and more accurate huang2017densely. ResNext xie2017aggregated is able to improve the classification accuracy by increasing the number of repeated basic building block that aggregates a set of transformations with the same topology. ResNet adopts a residual learning framework to ease the training of networks, which can make the networks substantially deeper than the previous models he2016deep. These three models are designed based on different ideas and they have some degree of complementary features. Based on these three models, we perform multi-model voting scheme to conduct the final classification by averaging the predicted scores of different models, as shown in Fig. 6 (e). DenseNet161, ResNet101, and ResNext101 are chosen in this work.

Figure 6: (a) a malignant patch from positive WSI, (b) a benign patch from positive WSI, (c) a benign patch from negative WSI, (d) comparison of different training strategies for Stage-1 and Stage-2, (e) multi-model voting scheme.

3.4 Segmentation

Our goal is to extract the pixel-level segmentation mask for the selected key patches and thus build the whole lesion segmenting result for the WSI. In this part, we first present related definitions and then give our proposed adversarial CAC-UNet architecture. After that, the backbone of CAC-UNet and the adversarial learning are detailed.

3.4.1 Definitions in Segmentation

To learn domain-invariant features and increase the generalization ability of our model, we split the training set into and subsets through clustering according to different texture and appearances of the WSIs, such as gland structure, stain style, and lesion distribution. For convenience, we use and representing two domains corresponding to subset and . We further define another two domains and to denote the model predicted segmentation masks and the expert labeled ground truth masks. We make a hypothesis that the expert labeled ground truth is often with a smooth and continuous boundary. If we put this prior constraint into the model training, the accuracy of the predicted masks will thus be improved.

To realize the domain-invariant feature learning, our model is built on the foundations of GAN. In our work, the discriminators refer to a series of classifiers based on CNNs, and the generator means the entire or part of the segmentation model, which generates some feature maps or segmenting masks as the input to different discriminators.

3.4.2 Architecture of the Proposed Adversarial CAC-UNet

In Fig. 7, the architecture of the proposed adversarial CAC-UNet is presented. Our architecture is composed of the main segmentation network, backbone of CAC-UNet, and three discriminators, , and . Note that the backbone of segmentation network CAC-UNet plays the role of generator .

The backbone of CAC-UNet is constructed based on the basic UNet ronneberger2015u, and consists of an encoder and a decoder part. The encoder takes the image as the input and maps it into feature maps. The decoder takes these feature maps as the input and transforms them to segmentation mask, which will be compared with the ground truth mask. The discriminator takes the encoder feature maps as the input and then combines them with the feature maps generated by itself to decide whether the input image belongs to domain or . Similarly, the discriminator combines the decoder feature maps and the feature maps generated by to conduct the same decision process. The discriminator takes both the predicted segmentation mask and the ground truth mask as the input to recognize whether the mask is generated by the model ( domain) or the expert ( domain).

In summarize, the optimization targets of our model are to: (1) Minimize the differences between the predicted segmentation masks and the ground truth segmentation masks; (2) discriminate images from domain from domain based on the encoder feature maps; (3) discriminate images from domain from domain

based on the decoder feature maps; (4) discriminate the predicted masks from ground truth masks. To achieve these four targets, the proposed final adversarial training optimization loss function is denoted as


where , , are trade-off parameters adjusting the importance of each term.

Figure 7: Overview of our proposed adversarial CAC-UNet framework. The backbone of CAC-UNet is the generator and serves the image segmentation. The discriminators ,, differentiate their inputs accordingly and thus generate adversarial losses.

3.4.3 Backbone of CAC-UNet

In the basic UNet ronneberger2015u, the skipped encoder feature maps and the up-sampled decoder feature maps are concatenated to perform segmentation by recovering the full spatial resolution at the model output. However, the disadvantages are obvious when processing histopathology images: the lack of the context aware ability and suffering context information loss, inability in handling the appearance inconsistency. We design the backbone of our CAC-UNet architecture targets on more powerful context aware and appearance invariant ability. The key details are highlighted as follows.

Figure 8: The backbone structure of our CAC-UNet. Top left: backbone segmentation structure; top right: decoder module; middle: PPM; bottom left: atrous convolution; bottom right: Res Block with IN in the decoder module.

Encoder Selection. We have tried different more powerful encoders. We chose the widely used ResNet and its improved ResNext as the encoder, and we tried models with different depths, such as 34, 50, 101. Through experiments, it is found that ResNet50 can achieve better performance on the validation set. While larger networks such as ResNet101 and ResNext101 have no obvious advantages, but there is a risk of over-fitting. Therefore, we choose ResNet50 as the encoder of the segmentation network.

Context-aware Design. To enhance the context-aware ability, we implemented two techniques proposed by work roy2018concurrent and zhao2017pyramid into the UNet: the Spatial-Channel Sequeeze & Excitation (SCSE) block and the Pyramid pooling module (PPM).

SCSE block is able to adjust the weighting of different network feature maps according to their importance: attach a higher weight to important feature maps or feature channels and attach small weight to reduces the influence of unimportant features. In our design, we integrate the SCSE attention block to the decoder part for realizing the context aware.

In UNet, the input image is encoded as a multi-channel (1024) feature maps through the processing of the encoder, and then the decoder transforms the feature maps to the final segmenting mask by a series of up-sampling and skip connected feature concatenation. The encoded multi-channel feature map contains most part of the information used for segmentation. However, the single scale will inevitably introduce context information loss. Motivated by work zhao2017pyramid, we put hierarchical PPM in the center of the network to aggregate more global information, which embeds information with different scales and varying among different sub-regions. Following the same structure with the PPM in zhao2017pyramid, we also adapt it to fuse four different pyramid scale feature maps, as shown in the middle of Fig. 8. Note that the atrous convolution is used (kernel: , rate = 2, 4) in the encoder unit E4 and E5 when enabling PPM scheme, as shown in lower left part of the Fig. 8.

Appearance Consistency Design. To force our CAC-UNet to learn features that are invariant to appearance changes, such as stain colors, lesion structure styles, we add the instance normalization (IN) ulyanov2017improved function to our network like the proposed IBN block by work pan2018two.

As denoted by work pan2018two

, IN is stronger for learning appearance invariant features and batch normalization (BN) is essential for preserving content related information. Thus we also apply IN and BN at the same time. In the encoder part, we only integrate IBN in E2 to E4, in order to enhance the domain adaptability of the model. In the decoder part, we embed IN into all the residual blocks of the decoding units, D1 to D5. The detailed structure is depicted as the lower right part of the Fig.


Feature Fusion. To achieve higher segmentation accuracy, we conduct pixel-level mask prediction based on hypercolumn hariharan2015hypercolumns

. The hypercolumn at a pixel is defined as the vector of activations of all feature map units at the same pixel-level location. Thus, the hypercolumn can help address the problem that: it is too coarse just considering the information of the decoder output. We upsampled the features of last layer in the decoding units and concatenated them to obtain a hypercolumn, which is used to predict the final segmentation mask. Through this scheme, the multi-scale features including both the global semantic information and the precise localization information are fused.

Segmentation Loss Function. To the segmentation output of CAC-UNet backbone, we apply the segmentation loss as


where represents the Dice loss; and denotes an image patch and the corresponding segmentation mask, respectively.

3.4.4 Domain-adversarial Learning

Targets. Target 1: learn feature maps that are invariant to different domains ( and ), thus the segmentation network can robustly segment images from different domains. For this target, the encoder and decoder of the segmentation model are served as the generators ( and ). Target2: Put mask prior to the model, and make the generated masks more like the ground truth. For this target, the whole segmentation model is served as the generator (). To perform adversarial learning and realize the optimization of these generators, we need to design discriminators (, and ) correspondingly, which will be detailed in the following. Although these discriminators will not be used in the inference stage, they are vital for adversarial learning and the proper discriminator can help the model achieve the above two targets efficiently. All the discriminators are presented in Fig. 9.

Figure 9: The structures of three discriminator: , and . and are mirror models of the encoder and decoder, respectively. adopts the same structure with the encoder.

Discriminator and . For discriminator and , the first primary issue is to choose which layer’s feature maps in the segmentation network as the input of the discriminator. It is intuitive to select the feature maps of the last layer (such as the last layer of the encoder or decoder) because these feature maps contain more discriminative high-level semantic information. However, in kamnitsas2017unsupervised, they found that it is not ideal to only select feature maps of the last layer to adapt because the early layers are more susceptible to appearance variations between domains. In order to ensure that all feature maps to be concatenated and adapted, the authors in kamnitsas2017unsupervised crop large size feature maps to match the size of the last layer and then concatenate them. However, due to the cropping, a lot of information is lost. Moreover, the number of the directly concatenated feature maps is too huge, which is difficult for the discriminator training.

To avoid feature map information loss and decently adjust the weights of these features at the same time, we designed two mirror networks of the encoder and decoder as discriminator and , as shown in Fig. 9: the left and the right part. Thus, our discriminator () uses a similar network structure with the encoder. The key details of discriminator () are denoted as follows: (1) The first layer of discriminator takes the first layer’s feature maps of the encoder as the input directly; (2) The other layers of will generate same size feature maps as the the corresponding encoder layers, and then are sequentially concatenated to the feature maps from the corresponding encoder layer. The decoder discriminator () is constructed in the same manner. Through the proposed mirrored discriminators (, ), we can ingeniously solve the problem of the inconsistent size of different layers’ feature maps instead of cropping the feature maps roughly so that the discriminator is able to use different layers’ feature maps completely without any loss.

We adopt binary cross entropy as loss to update parameters of or . The losses of and are shown as Eq. (10) and Eq. (11),


where is the distribution of data , is the data distribution of , is the feature maps of encoder, and is the feature maps of decoder.

During adversarial training, the losses of and are shown as Eq. (12) and Eq. (13).


Discriminator . As we presented before, the expert labeled ground truth masks () are often with smooth and continuous boundaries. We use this prior to guide the prediction of the segmentation network, by introducing an additional mask loss. We adopt mask discriminator to achieve this. The structure of is depicted as the top part of Fig. 9.

It distinguishes the mask between the domain and the domain . By adversarial learning, it can make the output of the segmentation network as close as possible to the ground truth, thus making their boundaries similar. We also adopt binary cross entropy as the loss to update the parameters of , which is shown as Eq. (14).


During adversarial training, we adopt Eq. (15) as the loss of .


Training. With the images and labels in both and , we can train the segmentation network () and the discriminators (, , ) in a supervised way. In the training phase, we try to make segment more accurately by adopting feature maps invariant to variations between and . In the initial stage, we train with by minimizing , where is the collection of image patches randomly sampled from or , and is the collection of their label masks. After training for epochs, we start to train , , independently for epochs with the trained by minimizing , and . Then, we obtain a initial and initial , , which can initially classify, and we start adversarial training them alternately until convergence mei2020cross. In particular, we use as the loss of segmentation network instead of when training alternately.

4 Experiment

4.1 Dataset

We evaluate our method on colonoscopy tissue segment dataset of MICCAI 2019 Challenge DigestPath2019 challenge-grand-2019. The training set contains a total of 450 patients’ 750 tissue slices of an average size of . The fine pixel-level annotations of lesion and the diagnosis of the tissues are labeled by experienced pathologists. The testing set contains another 150 patients’ 250 tissues. All WSIs were stained by hematoxylin and eosin and scanned at X20. Note that the testing set is not released to the public to guarantee that the test data cannot be included in the training procedure. To train and verify our model, we split the 750 WSIs into two parts: 682 WSIs for training and 68 WSIs for validation.

Except for the DigestPath2019 dataset, we also employ another two pathology image datasets to help validate our technologies, such as the label smoothing and domain adaptation. The first dataset is built based on Camlyon16 dataset Camelyon2016, including 50,000 training patches, 50,000 validation patches with size cropped from the training set of Camlyon16, and 60 testing WSIs (37 normal and 23 tumor). The second dataset contains renal biopsy pathology images with size from clinical routines of one top-tier hospital in Beijing, which are stained with Periodic Schiff-Methenamine Silver (PASM) or Periodic acid-Schiff (PAS) method. The training set consists of 6708 patches (4324 PASM stained patches and 2384 PAS stained patches), and the testing set includes 1524 patches (912 PASM stained patches and 612 PAS stained patches). The aim of the second dataset is to build model segmenting the glomeruli from the renal biopsy pathology images.

4.2 Implementation

The proposed method for WSI classification is implemented with Python3.6 and Pytorch0.4.1 using an NVIDIA GeForce GTX 1080 Ti GPU. All the training patches are cropped from 682 WSIs, using a patch size of

and a stride of 512 pixels. We trained networks with standard back propagation, which is performed by stochastic gradient descent method (momentum = 0.9 with weight decay 0.0001, batch size = 64, constant learning rate = 0.001), and the models converge to its optimal accuracy within 5 epochs. The proposed segmentation network is implemented with the Pytorch 1.0 framework with a Tesla V100. We used the RAdam and Lookahead optimizer (initial learning rate is 0.001, momentum parameters

= 0.95, = 0.999, weight decay = 0.0005, batch size = 8) to update the parameters of the networks. All the training patches are cropped from 223 WSIs, using a patch size of and a stride of 512 pixels. During training, these patches are resized to . We first trained the segmentation network for 20 epochs and fine-tune it with the domain-adversarial learning for 5 epochs.

The other related parameters of this work are summarized in Table 1.

30 0.1 0.1 0.1 0.01 0.001 0.001
Table 1: Other related parameters.

4.3 Evaluation Criteria

The evaluation of the WSI classification and lesion segmentation follows the challenge rule222 Classification accuracy, recall and precision are also involved in the evaluation of Stage-2 patch models. Besides, we will briefly analyze the computing complexity for our scheme.

WSI classification: The WSI classification is evaluated by classification area under the curve (AUC). AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. AUC is denoted as


where and are the scores for a positive and a negative instance, respectively; represents true positive rate, and represents false positive rate.

Accuracy, Recall and Precision.

Accuracy is the ratio of the corrected predicted images to the whole pool of validation samples. Recall is the proportion of real positives cases that are correctly predicted positive. Conversely, precision indicates the proportion of predicted positive cases that are correctly real positives. The three evaluation metrics above are depicted as follows:


where ,, and denote the number of true positives(TP), false positives(FP), true negatives(TN) and false negatives(FN) respectively.

Lesion segmentation: The lesion segmentation is evaluated by Dice Similarity Coefficient (DSC). The Dice metric measures area overlap between segmentation results and ground truth annotations. DSC can be written as


where and denote the sets of foreground pixels in the annotation and the corresponding sets of foreground pixels in the predicted segmentation result, respectively.

Team DSC DSC Rank AUC AUC Rank Final Rank
kuanguang 0.8075 1 1.0000 1 1
zju_realdoctor 0.7789 5 1.0000 1 2
TTA_Lab 0.7878 3 0.9948 4 3
SJTU_MedicalCV 0.7928 2 0.9773 6 4
ustc_czw 0.7862 4 0.9784 5 5
chenpingjun 0.7197 8 0.9974 3 6
MCPRL_218 0.7397 7 0.9745 8 7
path_fitting 0.6794 10 0.9754 7 8
mirl_task2 0.7590 6 0.5164 13 9
Roselia 0.6920 9 0.8886 11 10
Table 2: DSC and AUC of the MICCAI 2019 Challenge on Digestive-System Pathological Detection and Segmentation. The results indicate that our approach outperforms other involved methods.

4.4 Tissue Segmentation and Classification Comparisons

On the unreleased test data of DigestPath2019, the final tissue segmentation and classification results are reported as Table 2333The challenge result can be found in Our method achieves the best DSC and AUC (AUC is the same as the zjurealdoctor team) at the same time, and we achieve the best Final Rank among all methods. This demonstrates that our model has strong generalization ability on the unknown test data. From Table 2, we can observe that most of the methods achieve high AUC for WSI-level classification, but fail to obtain decent DSC for segmentation. This indicates that the lesion segmentation task is more challenging compared to WSI classification.

LS Patch Acc. WSI AUC
10% - 0.97526 0.9990
5% - 0.99357 0.993
10% 0.98538 0.9981
5% 0.99470 1
Table 3: WSI AUC and patch-level accuracy of proposed methods in Stage-1 on our validation dataset.

4.5 WSI Classification

Total 29504 patches are sampled to train the patch-level model used for WSI classification, and 3557 patched are utilized as the validation set. As denoted before, 68 WSIs are used to evaluate the WSI-level classification. Note that we label a patch as the positive sample when the malignant area ratio is bigger than a threshold . In our experiment, we include two thresholds and . We also validate the proposed label smoothing (LS) under .

LS Patch Acc. WSI AUC
- 0.95234 0.9330
0.95618 0.9450
Table 4: WSI AUC and patch-level accuracy of proposed methods in Stage-1 on Camlyon16 dataset.

We summarize the WSI and patch classification results of Stage-1 in Table 3. Note that the patch level accuracy is listed just for showing the performance of our model in the collected patch-level validation set. For patch level accuracy, our proposed method achieves 0.97526, 0.99357, 0.98538 and 0.99470 corresponding to: 1) LS close and ; 2) LS close and ; 3) LS open and ; 4) LS open and , respectively. Smaller threshold can introduce higher patch accuracy under the same LS configuration, as denoted in Table 3. Under the same threshold , the LS can bring 1% accuracy increase, from 0.97526 to 0.98538, which validates the effectiveness of the proposed LS. Based on the patch level prediction results, the WSI classification is performed. Corresponding to the above four configurations, 0.9990, 0.993, 0.9981 and 1 WSI AUC are obtained on our validation set. We submitted the first three solutions (the fourth solution is conducted after the DigestPath2019 challenge) to the challenge, one of the solutions achieve WSI AUC 1 on the unknown testing dataset. Note that the observed testing phenomenon may be different between our defined validation set and the final testing set. These differences will not be discussed due to the unavailability of the testing set. We also tested our proposed LS scheme on Camlyon16 dataset, and listed the results in Table 4. On this dataset, our scheme can bring more than 1% WSI AUC gain (from 0.9330 to 0.9450), which further verifies the effectiveness of our method. It should be noted that we follow the same image patch labeling in work li2018cancer, where the ground truth label is determined by the center pixel label in the corresponding down-sampled patch, and thus is not applicable.

When the positive WSIs are detected in Stage-1, we apply 3 patch models which trained by another sampled set to conduct key positive patch selection. The patch-level classification result in Stage-2 are listed in Table 5, which denotes that our adopted multi-model voting scheme (Ensemble) obtains the best recall (0.9362), precision (0.9027) and accuracy (0.8935), respectively. This patch-level classification result is vital for the following segmentation. In our validation set, the recall reaches 0.9362 and the final segmentation result indicates this recall is sufficient.

Method Recall Precision Accuracy
ResNet 0.9293 0.8804 0.8721
DenseNet 0.9261 0.8986 0.8832
ResNeXt 0.9198 0.8765 0.8648
Ensemble 0.9362 0.9027 0.8935
Table 5: Recall, Precision and Accuracy of four methods: ResNet101, DenseNet161, ResNeXt101, and Ensemble. Ensemble means the voting result of adopted three models. Note that all the models are trained with patches cropped from positive WSIs.

4.6 Lesion Segmentation

Method Patch WSI
Work mejbri2019deep 0.8568 0.8120
Work tao2019highly 0.8505 0.7591
Ours 0.8749 0.8292
Table 6: Segmentation performance comparisons among work mejbri2019deep, work tao2019highly and our proposed method.

We compare our proposed method with another two WSI segmentation solutions: 1) directly performing segmentation on cropped patches based on UNet mejbri2019deep; 2) performing patch classification first and then segmenting the selected positive patches tao2019highly. Testing is presented on two validation datasets: 1) the sampled lesion patches from WSI; 2) the entire 68 validation WSIs. The lesion segmentation results are tabulated in Table 6. As presented in Table 6, our work achieves the best patch level and WSI level segmentation accuracy among all the listed methods. The reported accuracy of our work shows 1.74% and 1.7% accuracy gain than the second place schemes on patch and WSI datasets, respectively. We submit the proposed method to the challenge and achieve WSI DSC 0.8075 on the unknown testing dataset, which outperforms all the other methods.

Figure 10: Visual quality comparisons of image segmentation on four selected samples. From left column to right column: input images, ground truth, FCN, UNet, DeepLab, and our method.

Some visual results of the segmented images by various algorithms are presented in Fig. 10. Obviously, our proposed method generates the best perceptual results. The proposed method not only can segment out each positive areas but also preserves much finer texture details, showing much better smooth and continuous visual boundaries than basic UNet, FCN and DeepLab chen2017deeplab.

4.7 Ablation Study

We further discuss the contributions of each component in our scheme to analyze the performance gain in detail. The ablation study includes two parts: the multi-level detection architecture and the segmentation techniques.

Method 1 Level 2 Levels 3 Levels
FPN 0.7058 0.7940 0.8097
FCN 0.7550 0.7979 0.8177
Deeplab 0.7591 0.7947 0.8134
CAC-UNet 0.6663 0.7980 0.8292
Table 7: Performance comparisons of four models (FPN, FCN, DeepLab, and CAC-UNet) under three architecture configurations: 1 Level, 2 Levels and 3 Levels.

Multi-level detection architecture. In our proposed scheme, we adopt multi-level detection architecture: we first detect the positive WSIs (Stage-1), then detect the positive patches in positive WSIs (Stage-2), and finally segment the selected patches (Stage-3). Note in different stages, the proposed models and techniques are different. Four models are involved (FPN lin2017feature, FCN, Deeplab, and CAC-UNet) for testing under 3 configurations: 1) directly segmenting all the patches of each WSI (1 Level); selecting the key patches and then segmenting the key patches (2 Levels); selecting key WSIs first, then choose key patches and finally segmenting the key patches (3 Levels). The results show that all the models with 3 Levels of architecture achieve the best performance when compared with the 1 Level and 2 Levels configurations, which indicates that the multi-level (three levels) detection architecture is very important for the segmentation of WSI. We believe that for a WSI with large resolutions, taking multi-level architecture having more advantages than the 1 Level or 2 Levels scheme. For example, we do not need to segment a negative WSI at all. However, if we directly crop this WSI into many patches, and then perform segmentation for these patches (or the selected patches when using the architecture of 2 Levels) one by one with segmenting model. This will bring risk for the detection of many false positive areas. We are convinced that the multi-level detection can conduct better results by flexibly applying different level information. Note that in Table 7, our proposed CAC-UNet performs poor in 1 Level and 2 Levels configuration. The reason is that our model is trained based on the positive patches selected from the positive WSIs, and this model may fail to detect the negative WSIs or the negative patches of the positive WSIs. To improve the performance under architectures of 1 Level or 2 Levels, the proposed CAC-UNet should be retrained using proper samples.

UNet Aug. IBN Hypercolumn SCSE PPM Patch DSC WSI DSC
0.8490 0.8082
0.8568 0.8120
0.8628 0.8158
0.8646 0.8162
0.8706 0.8243
0.8726 0.8265
Table 8: Performance gains by gradually integrating the adopted techniques (Aug., IBN, Hypercolumn, SCSE, and PPM) to UNet.

Segmentation Techniques. In this ablation study, we demonstrate the effectiveness of different techniques used in our segmentation model. Table 8 shows performance gains when gradually adding the adopted techniques to UNet: data argumentation (Aug.), IBN, Hypercolumn, SCSE , PPM. The backbone of our proposed CAC-UNet achieves 0.8726 patch DSC and 0.8265 WSI DSC. We further analyze the performance of different discriminators based on CAC-UNet, and list the related results in Table 9. The results denote that each discriminator can further boost the patch-level and WSI-level performance. When integrating , and together, the segmentation model produces the best performance, which indicates that these discriminators have some complementary nature in pathology image segmentation.

Method Patch DSC WSI DSC
0.8730 0.8275
0.8731 0.8274
0.8731 0.8275
0.8749 0.8292
Table 9: Performance comparisons by using different discriminators.
Method Patch DSC WSI DSC
UNet 0.8513±0.0025 0.8096±0.0024
Work tellez2018whole 0.8521±0.0029 0.8102±0.0027
Ours () 0.8565±0.0034 0.8167±0.0031
Table 10: Performance comparisons between work tellez2018whole and our domain adaptation scheme with three discriminators on DigestPath2019 dataset. Both work tellez2018whole and our method are realized based on basic UNet.
Method Patch DSC
UNet 0.9074±0.0025
Work tsai2018learning 0.9112±0.0026
Work kamnitsas2017unsupervised 0.9159±0.0018
Our () 0.9174±0.0012
Our () 0.9166±0.0015
Our () 0.9153±0.0012
Our () 0.9187±0.0013
Table 11: Performance comparisons with work tsai2018learning, work kamnitsas2017unsupervised and our domain adaptation scheme on renal biopsy pathology dataset. The PASM stained patches and the PAS stained patches are used as the source and target domains, respectively. To make fair comparisons, all involved schemes are reproduced (tsai2018learning and kamnitsas2017unsupervised) or realized (our scheme) based on the basic UNet.
Figure 11: Performance verification of our proposed DA scheme on DigestPath2019 dataset. Visual quality comparisons of image segmentation on six selected samples. From left column to right column: input images, ground truth, UNet, and UNet+(++).
Figure 12: Performance verification of our proposed DA scheme on renal biopsy pathology dataset. Visual quality comparisons of image segmentation on six selected samples. From left column to right column: input images, ground truth, UNet, and UNet+(++).

The above Table 9 is our basic ablation experiment of adversarial learning scheme for image segmentation in DigestPath2019 challenge. To further verify the effectiveness of our method, we conduct two more experiments: (1) comparisons between our scheme and work tellez2018whole which achieves domain invariance by using stain augmentation; (2) comparative analysis of our method and strategies in work tsai2018learning and work kamnitsas2017unsupervised. Different from the above comparisons, all results listed in Table 10 and Table 11 are the averages of 10 trials to more stably analyze the performance of our DA.

From Table 10, we can see that our adversarial domain adaption method achieves better performance than work tellez2018whole for both patch and WSI DSC. It should also be noted that our scheme obtains higher gains in UNet than in the proposed backbone of CAC-UNet (see Table 9). Part of the reason for this phenomenon is that our proposed backbone of CAC-UNet has already integrated some techniques that can learn appearance invariant features, such as the IBN block. On the renal biopsy pathology dataset, our scheme introduces about 0.0113 patch DSC gain (from 0.9074 to 0.9187) than the basic UNet, which shows superior results than the recent domain adaption work tsai2018learning and work kamnitsas2017unsupervised. We also verify our proposed DA scheme through visual quality comparisons on both DigestPath2019 and renal biopsy pathology dataset, as depicted by Fig. 11 and Fig. 12. The visual quality comparisons also denote that our DA scheme can bring higher performance gain on the renal biopsy pathology dataset, which is similar to the objective performance in Table 10 and Table 11.

4.8 Computing Complexity Analysis

On our 68 validation WSIs with size from 2371 1792 to 11246 23473, the average processing time is 15.3s, which is far below than the upper limit (120s) required by the challenge. In Fig. 13, we depict the processing time for 22 pairs of WSIs. Each pair contains a positive WSI and a negative WSI, and the WSIs in the same pair have similar resolutions. We sort the WSI pairs in ascending order by their resolutions. As shown by Fig. 13, with the increase of WSI resolutions, the processing time of positive WSIs (the red points) increases correspondingly with fast speed. However, the processing time of the negative WSIs (the black points) just rise slightly, and this is because most of the WSIs are dropped in Stage-1 and will not be processed in Stage-2 and Stage-3. Thus much processing time can be saved by the negative WSIs and then our scheme can focus on the processing of important positive WSIs.

Figure 13: Processing time of selected 22 WSI pairs (44 WSIs). The red and black points denote the positive WSIs and negative WSIs, respectively.

5 Discussion

Automatic and objective medical diagnostic model can be valuable to achieve early cancer detection and diagnosis based on different pathological WSIs, and thus can reduce the mortality rate. The existing method directly apply the same patch-level model to perform both WSI-level and patch-level classification is hard to balance both tasks at the same time. Besides, the existing segmentation models are lack of the ability of appearance invariant. In this study, we proposed a highly efficient multi-level malignant tissue detection architecture, and designed an adversarial CAC-UNet to achieve robust appearance-invariant segmentation.

Figure 14: False positive area illustration. Performing UNet on a cropped negative WSI, the red part denotes the detected false positive areas.

We found that our proposed scheme achieves the best performance on DigestPath2019 colonoscopy tissue segmentation and classification task (see Table 2), indicating the effectiveness of the proposed multi-level colonoscopy malignant tissue detection by using the designed adversarial CAC-UNet. Specifically, our proposed three-level detection can conduct better results than one-level and two-level architecture (see Table 7). This architecture can lower the risk of predicting many false positive areas. Fig. 14 shows many false positive areas in a cropped negative WSI without any malignant tissue by using UNet, which confirms that the one-level architecture suffers false positive area detection. We also show that our proposed CRC-UNet backbone and the adversarial scheme can accurately conduct the tissue segmentation (see Table 6, Table 8 and Table 9).

Our results provide compelling performance for colonoscopy malignant tissue detection through the proposed multi-level adversarial CAC-UNet. However, some limitations are worth noting. The key patch selection scheme of Stage-2 performs mediocrely on the validation dataset, which needs further improvement in the future. Besides, in this work we just divide the training set into two domains, however multiple domains should be studied for further increasing the generalization ability. Future work should therefore include to design stronger DA scheme. We release our codes in the GitHub to support the possible interested discussion.

6 Conclusion

We have presented a multi-level colonoscopy malignant tissue detection based on the proposed adversarial CAC-UNet, and we have shown that the proposed detection architecture with our segmentation model achieve superior results. The promising results and designed algorithms can be applied to automatic diagnosis scenario.

7 Acknowledgement

This work was supported in part by the Beijing Natural Science Foundation (4182044 and 7202056), the National Natural Science Foundation of China (81972248). This work is conducted on the platform of Center for Data Science of Beijing University of Posts and Telecommunications.