CELNet: Evidence Localization for Pathology Images using Weakly Supervised Learning

09/16/2019 ∙ by Yongxiang Huang, et al. ∙ The Hong Kong University of Science and Technology 0

Despite deep convolutional neural networks boost the performance of image classification and segmentation in digital pathology analysis, they are usually weak in interpretability for clinical applications or require heavy annotations to achieve object localization. To overcome this problem, we propose a weakly supervised learning-based approach that can effectively learn to localize the discriminative evidence for a diagnostic label from weakly labeled training data. Experimental results show that our proposed method can reliably pinpoint the location of cancerous evidence supporting the decision of interest, while still achieving a competitive performance on glimpse-level and slide-level histopathologic cancer detection tasks.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pathology analysis based on microscopic images is a critical task in medical image computing. In recent years, deep learning of digitalized pathology slide has facilitated the progress of automating many diagnostic tasks, offering the potential to increase accuracy and improve review efficiency. Limited by computation resources, deep learning-based approaches on whole slide pathology images (WSIs) usually train convolutional neural networks (CNNs) on patches extracted from WSIs and aggregate the patch-level predictions to obtain a slide-level representation, which is further used to identify cancer metastases and stage cancer

[Wang2016DeepLF]. Such a patch-based CNN approach has been shown to surpass pathologists in various diagnostic tasks [Liu2017DetectingCM]

Off-the-shelf CNNs have been shown to be able to accurately classify or segment pathology images into different diagnostic types in recent studies

[huang2018improving, veeling2018rotation]. However, most of these methods are weak in interpretability especially for clinicians, due to a lack of evidence supporting for the decision of interest. During diagnosis, a pathologist often inspects abnormal structures (e.g., large nucleus, hypercellularity) as the evidence for determining whether the glimpsed patch is cancerous. For CAD systems, learning to pinpoint the discriminative evidence can provide precise visual assistance for clinicians. Strong supervision-based feature localization methods require a large number of pathology images annotated in pixel-level or object-level, which are very costly and time-consuming and can be biased by the experiences of the observers. In this paper, we propose a weakly supervised learning (WSL) method that can learn to localize the discriminative evidence for the class-of-interest on pathology images from weakly labeled (i.e. image-level) training data. Our contributions include: i) proposing a new CNN architecture with multi-branch attention modules and deep supervision mechanism, to address the difficulty of localizing discrete and small objects in pathology images, ii) formulating a generalizable approach that leverages gradient-weighted class activation map and saliency map in a complementary way to provide accurate evidence localization, iii) designing a new attention module which allows capturing spatial attention from various context, iv) quantitatively and visually evaluating WSL methods on large scale histopathology datasets, and v) constructing a new dataset (HPLOC) based on Camelyon16 for effectively evaluating evidence localization performance on histopathology images.

Related Work. Recent studies have demonstrated that CNN can learn to localize discriminative features even when it is trained on image-level annotations [zhou2016learning]. However, these methods are evaluated on natural image datasets (e.g., PASCAL), where the objects of interest are usually large and distinct in color and shape. In contrast, objects in pathology images are usually small and less distinct in morphology between different classes. A few recent studies investigated WSL approaches on medical images, including lung nodule detection and placental ultrasound images [feng2017discriminative]. These methods employ GAP-based class activation map and require CNNs ending with global average pooling, which degrades the performance of CNNs as a side effect [zhou2016learning].

2 Methods

The overview of the framework is shown in Fig. LABEL:fig:fig1. The model is trained to predict the cancer score for a given image, indicating the presence of cancer metastasis. In the test phase, besides giving a binary classification, the model generates a cancerous evidence localization map and performs localization.

2.1 Cancerous Evidence Localization Networks (CELNet)

Given the object of interest is relatively small and discrete, a moderate number of convolutional layers is sufficient for encoding locally discriminative features. As discussed in Section 1, instances on pathology images are similar in morphology and can be densely distributed, the model should avoid over-downsampling in order to pinpoint the cancerous evidence from the densely distributed instances. The proposed CELNet starts with a convolution head followed by 3 Multi-branch Attention-based Residual Modules (MA-ResModule) 111

Densely connected module is not employed considering it is comparatively speed-inefficient for WSIs application due to its dense tensor concatenation.

[he2016deep]. Each MA-ResModule is composed of 3 consecutive building blocks integrated with the proposed attention module (MAM) as shown in Fig. LABEL:fig:fig1 (Right). We use

convolution with stride of 2 for downsampling in residual connections instead of

convolution to reduce information loss. Batch normalization and ReLU are applied after each convolution layer for regularization and non-linearity.

2.1.1 Multi-branch Attention Module (MAM)

To eliminate the effect of background contents and focus on representing the cancerous evidence (which can be sparse), we employ attention mechanism. Improved on Convolutional Block Attention Module (CBAM) , which extracts channel attention and spatial attention of an input feature map in a squeeze and excitation manner, we propose a multi-branch attention module. MAM can better approximate the importance of each location on the feature map by looking at its context at different scales. Given a squeezed feature map generated by the channel attention module, we compute and derive a 2D spatial attention map by where represents a convolution operation with kernel size of , and

denotes the sigmoid function. We set

in our experiments, corresponding to 3 branches. Hereby, the feature map is refined by element-wise multiplication with the spatial attention map .

MAM is conceptually simple but effective in improving detection and localization performance as demonstrated in our experiments.

2.1.2 Deep Supervision

Deep supervision [lee2015deeply] is employed to empower the intermediate layers to learn class-discriminative representations, for building the cancer activation map in a higher resolution. We achieve this by adding two companion output layers to the last two MA-ResModules, as shown in Fig. LABEL:fig:fig1

. Global max pooling (GMP) is applied to search for the best discriminative features spatially, while global average pooling (GAP) is applied to encourage the network to identify all discriminative parts on the image. Each companion output layer applies GAP and GMP on the input feature map and concatenates the resulting vectors. The cancer score of the input image is derived by concatenating the outputs of the two companion layers followed by a fully convolutional layer (i.e., kernel size

) with a sigmoid activation. CELNet enjoys efficient inference when applied to test WSIs, as it is fully convolutional and avoids repetitive computation for the overlapping part between neighboring patches.

2.2 Cancerous Evidence Localization Map (CELM)

2.2.1 Cancer Activation Map (CAM)

Given an image , let represent the cancer score function governed by the trained CELNet (before sigmoid layer). A cancer-class activation map shows the importance of each region on the image to the diagnostic value. For a target layer , the CAM is derived by taking the weighted sum of feature maps with the weights { }, where represents the importance of feature plane. The weights are computed as , i.e., spatially averaging the gradients of cancer score with respect to the feature plane , which is achieved by back propagation (see Fig.LABEL:fig:fig1). Thus, the CAM of layer can be derived by , where ReLU is applied to exclude the features with negative influence on the class of interest [selvaraju2017grad].

We derive two CAMs, and from the last layer of the second and the third residual module on CELNet respectively (i.e., CAM2 and CAM3 in Fig.LABEL:fig:fig1). CAM3 can represent discriminative regions for identifying a cancer class in a relatively low resolution while CAM2 enjoys higher resolution and still class-discriminative under deep supervision.

2.2.2 Cancer Saliency Map (CSM)

In contrast with CAM, the cancer-class saliency map shows the contribution of each pixel site to the cancer score . This can be approximated by the derivate of a linear function . Thus the pixel contribution is computed as . Different from [Simonyan2013DeepIC], we derive by the guided back-propagation [springenberg2014striving] to prevent backward flow of negative gradients. For a RGB image, to obtain its cancer saliency map from , we first normalize to range, followed by greyscale conversion and Gaussian smoothing, instead of simply taking the maximum magnitude of as proposed in [Simonyan2013DeepIC]. Thus, the resulting cancer saliency map (see Fig. 1 (b)) is far less noisy and more focus on class-related objects than the original one proposed in [Simonyan2013DeepIC].

2.2.3 Complementary Fusion

The generated CAMs coarsely display discriminative regions for identifying a cancer class (see Fig.1 (c)), while the CSM is fine-grained, sensitive and represents pixelated contributions for the identification (see Fig.1 (b)). To combine the merits of them for precise cancerous evidence localization, we propose a complementary fusion method. First, CAM3 and CAM2 are combined to obtain a unified cancer activation map as , where

denotes a upsampling function by bilinear interpolation, and the coefficient

in range [0,1] is confirmed by validation. The CELM is derived by complementarily fusing CSM and CAM as , where denotes element-wise product, and the coefficient captures the reliability of the point-wise multiplication of CAM and CSM, and the value of

is estimated by cross-validation in experiments.

3 Experiments & Results

We first evaluate the detection performance of the proposed model as for clinical requirements, followed by evidence localization evaluations.

3.1 Datasets and Experimental Setup

The detection performance of the proposed method is validated on two benchmark datasets, PCam[veeling2018rotation] and Camelyon16 222https://camelyon16.grand-challenge.org.

PCam: The PCam dataset contains 327,680 lymph node histopathology images of size with binary class labels indicating the presence of cancer metastasis, split into 75% for training, 12.5% for validation, and 12.5% for testing as originally proposed. The class distribution in each split is balanced (1:1). For a fair comparison, following [veeling2018rotation], we perform image augmentation by random 90-degree rotations and horizontal flipping during training.

Camelyon16: The Camelyon16 dataset includes 270 H&E stained WSIs (160 normal and 110 cancerous cases) for training and 129 WSIs held out for testing (80 normal and 49 cancerous cases) with average image size about , where regions with cancer metastasis are delineated in cancerous slides. To apply our CELNet on WSIs, we follow the pipeline proposed in [Liu2017DetectingCM]

, including WSI pre-processing, patch sampling and augmentation, heatmap generation, and slide-level detection tasks. For slide-level classification, we take the maximum tumor score among all patches as the final slide-level prediction. For tumor region localization, we apply non-suppression maximum algorithm on the tumor probability map aggregated from patch predictions to iteratively extract tumor region coordinates. We work on the WSI data at 10

resolution instead of 40 with the available computation resources.

In our experiments, all models are trained using binary cross-entropy loss with L2 regularization of

to improve model generalizability, and optimized by SGD with Nesterov momentum of 0.9 with a batch size of 64 for 100 epochs. The learning rate is initialized with

and is halved at 50 and 75 epochs. We select model weights with minimum validation loss for test evaluation.

3.2 Classification Results

As Tbl. 1 shows, CELNet consistently outperforms ResNet, DenseNet, and P4M-DenseNet [veeling2018rotation] in histopathologic cancer detection on the PCam dataset. P4M-DenseNet uses less parameters due to parameter sharing in the p4m-equivariance. For auxiliary experiments, we perform ablation studies and visual analysis. From Tbl. 1 , we observe that our attention module brings 1.77% accuracy gain, which is larger than the gain brought by CBAM [woo2018cbam]. Both the CAM and CELM on CELNet are mainly activated for the cancerous regions (see Fig.1 (c) and (d)). These subfigures indicate that CELNet is effective in extracting discriminative evidence for histopathologic classification.

[capbesideposition=left,top ]table[] Methods Acc AUC #Params ResNet18 [he2016deep] 88.73 95.36 11.2M DenseNet [veeling2018rotation] 87.20 94.60 902K P4M-DenseNet 89.80 96.30 119K CELNet 91.87 97.72 297K CELNet 90.10 96.45 292K CELNet +CBAM 90.86 97.17 296K

Table 1: Quantitative comparisons on the PCam test set. P4M-DenseNet [veeling2018rotation]: current SoTA method for the PCam benchmark, CELNet: our method, : removal of the proposed multi-branch attention module, +CBAM: integration with convolutional block attention module [woo2018cbam].

On slide-level detection tasks, as shown in Tbl.2, our CELNet based approach achieves higher classification performance (1.7%) in terms of AUC than the baseline method [Liu2017DetectingCM], and outperforms previous state-of-the-art methods in slide-level tumor localization performance in terms of FROC score. The results illustrate that instead of using off-the-shelve CNNs as the core patch-level model for histopathologic slide detection, adopting CELNet can potentially bring larger performance gain. CELNet is more parameter-efficient as shown in Tbl.1 and testing a slide on Camelyon16 takes about 2 minutes on a Nvidia 1080Ti GPU.

[capbesideposition=left,top ]table[] Methods AUC FROC P4M-DenseNet - 84.0 Liu [Liu2017DetectingCM] 96.5 79.3 Challenge Winner [Wang2016DeepLF] 99.4 80.7 Pathologist 96.6 73.3 CELNet 97.2 84.8

Table 2: Quantitative comparisons of slide-level classification performance (AUC) and slide-level tumor localization performance (FROC) on the Camelyon16 test set. *: The Challenge Winner uses resolution while results of other methods are based on .

3.3 Weakly Supervised Localization and Results

Given that the trained CELNet can precisely classify a pathology image, here we aim to investigate its performance in localizing the supporting evidence based on the proposed CELM. To achieve this, based on Camelyon16, we first construct a dataset with region-level annotations for cancer metastasis, namely HPLOC, and develop the metrics for measuring localization performance on HPLOC.

HPLOC: The HPLOC dataset contains 20,000 images of size with segmentation masks for cancerous region. Each image is sampled from the test set of Camelyon16 and contains both cancerous regions and normal tissue in the glimpse, which harbors the high quality of the Camelyon16 dataset.

Metrics: To perform localization, we generate segmentation masks from CELM/CAM/CSM by thresholding and smoothing (see Fig.1 (e)). If a segmentation mask intersects with the cancerous region by at least 75% 333The annotated contour in Camelyon16 is usually enlarged to surround all tumors.

, it is defined as a true positive. Otherwise, if a segmentation mask intersects with the normal region by at least 75%, it is considered as a false positive. Thus, we can use precision and recall score to quantitatively assess the localization performance of different WSL methods, where the results are summarized in Tbl.


[capbesideposition=left,top ]table[] Methods Precision Recall ResNet18 + Backprop [Simonyan2013DeepIC] 79.8 85.5 ResNet18 + GradCAM [selvaraju2017grad] 85.6 82.4 Ours 91.6 87.3 Ours w/o MAM 88.1 85.6 Ours w/o DS 90.5 87.7 CELNet + GradCAM 91.0 85.4

Table 3: Quantitative comparisons for different weakly supervised localization methods on the HPLOC dataset. Ours: CELNet + CELM. MAM and DS are short for multi-branch attention module and deep supervision respectively.
(a) Input (b) CSM (c) CAM (d) CELM (e) Localization (f) GT
Figure 1: Evidence localization results of our WSL method on the HPLOC dataset. (a) Input glimpse, (b) Cancer Saliency Map, (c) Cancer Activation Map, (d) CELM: Cancerous Evidence Localization Map, (e) Localization results based on CELM, where the localized evidence is highlighted for providing visual assistance, (f) GT: ground truth, white masks represent tumor regions and the black represents normal tissue

We observe that our WSL method based on CELNet and CELM consistently performs better than the back propagation-based approach [Simonyan2013DeepIC] and the class activation map-based approach [selvaraju2017grad]. Note that we used ResNet18 [he2016deep] as the backbone for the compared methods because it achieves better classification performance and provides higher resolution for GradCAM (1212) as compared to DenseNet (3 3) [veeling2018rotation]. We perform ablation studies to further evaluate the key components of our method in Tbl.3. We observe the effectiveness of the proposed multi-branch attention module in increasing the localization accuracy. The deep supervision mechanism effectively improves the precision in localization despite slightly lower recall score, which can be caused by the regularization effect on the intermediate layers, that is, encouraging the learning of discriminative features for classification but also potentially discouraging the learning of some low-level histological patterns. We observe that using CELM can improve the recall score and precision, which indicates that CELM allows better discovery of cancerous evidence than using GradCAM. We present the visualization results in Fig. 1, the cancerous evidence is represented as large nucleus and hypercellularity in the images, which are precisely captured by the CELM. Fig.1(e) visualizes the localization results by overlaying the segmentation mask generated from CELM onto the input image, which demonstrates the effectiveness of our WSL method in localizing cancerous evidence.

4 Discussion & Conclusions

In this paper, we have proposed a generalizable method for localizing cancerous evidence on histopathology images. Unlike the conventional feature-based approaches, the proposed method does not rely on specific feature descriptors but learn discriminative features for localization from the data. To the best of our knowledge, investigating weakly supervised CNNs for cancerous evidence localization and quantitatively evaluating them on large datasets have not been performed on histopathology images. Experimental results show that our proposed method can achieve competitive classification performance on histopathologic cancer detection, and more importantly, provide reliable and accurate cancerous evidence localization using weakly training data, which reduces the burden of annotations. We believe that such an extendable method can have a great impact in detection-based studies in microscopy images and help improve the accuracy and interpretability for current deep learning-based pathology analysis systems.